Monday, 26 November 2007

Implementing Fraud Detection using Bayesian methods in Data Sets with Benford's Law

The journal; “Communications in Statistics: Simulation and Computation” published by “Taylor & Francis” featured an article “Detecting Fraud in Data Sets Using Benford's Law” in Volume 33, Number 1 / 2004 (Pages: 229– 46). This article by Christina Lynn Geyer and Patricia Pepple Williamson looks to the use of Bayesian networks rather than the Distortion Factor (DF) Model which is generally used to detect fraud in financial data.

This paper derives from the use of a Bayesian alternative which the authors state “outperforms the DF model for any reasonable significance level. Similarly, the Bayesian approach proposed as an alternative to the classical chi-square goodness-of-fit test outperforms the chi-square test for reasonable significance levels”.[1]

The purpose of the original project that spawned this post was to write an R function that implements this approach to using Benford’s law. The input will be a set of financial data. The expected output will be a statistical likelihood of fraudulent transactions being present in the data.

The aim was to provide an alternative approach to analysing data than Distortion Factor (DF) model as was developed by Mark Nigrini and first appeared in Nigrini (1996). The DF model makes two assumptions, these are:

  • “That people do not manipulate data outside of the original magnitude in other words, a person is more likely to change a 10 to a 12 than change a 10 to a 100.”[2],
  • And that “percentage of manipulation is approximately equal across the magnitudes. This means that someone may change a 50 to a 55 or a 500 to a 550, but would probably not change a 500 to a 505.”[3]

Taylor & Francis (2004) propose a Bayesian approach as an alternative to the DF model first proposed by Nigrini. They demonstrate that this process is more efficient for any reasonable significance level. They further note that although there is little value in comparing hypothesises using different approaches that as the DF and Bayes methods of expressing the likelihood of finding fraudulent data are based on different calculations that they may be compared for validity. Their results and data conclude that the Bayesian model is a valid alternative to Nigrini’s DF model.

The process is of great interest to Tax accounting, financial audit and forensic data analysis. As data in a company’s financial reports should confirm to Benford’s law if truthfully reported, nonconformity will raise a level of distrust even if the data is valid.

The paper discusses the existing methods used to implement a data analysis using Benford’s law and compares these with two Bayesian alternatives including the one proposed by the authors and another by Ley (1996).

The authors have demonstrated (using a variety of data sets) that the Bayesian approach is valid and gives the same results as the DF method. They more importantly note that it is more efficient as well as being valid. This is of value in the accounting and audit sectors. The improved performance makes the possibility of ongoing data analysis likely. Increasing the chances to automate and review data to detect fraud on an ongoing basis makes this process highly valuable to business.

Algorithms
This algorithm details the method used in the calculation of the Bayesian number proposed by Geyer and Williamson (2004). The alternative (and the original) method called the Distortion Factor (DF) developed by Nigrini (2000) is included for comparison.

  • β0 is the Bayes Factor for the relative likelihood oh H0 to H1 provided solely by the data as defined by Geyer and Williamson (2004)
  • β1 is the alternate Bayes Factor as defined by Geyer and Williamson (2004) in section 4.1 of their paper.
  • θo is the mean of a Benford set scaled to the interval [10, 100]

  • AM is the Actual Mean
  • EM is the Expected Mean
  • DF is the Distortion Factor
  • The Distortion Factor is defined by Nigrini (2000, p61) as a method of testing conformity of data to Benford’s law.

Using the census dataset demonstrates a real world example of Benford’s Law. This dataset is a set of US Census data as used by Nigrini (2000) and distributed by him. The other datasets are ones which come with the R package.

We can see that the biggest issue with these packages is their size. It would be expected that increasing the size of these datasets would bring them more into line with the expectations of Benford’s law.

It must be further noted that the 2 digit tests require a far larger sample to confirm to the law.

A known bad dataset based on Airtravel is demonstrated below.


These techniques provide a good staring point for fraud analysis, which is where I currently use these (and why I developed these techniques).

Where I plan to take this is traffic and anomaly analysis. The methods match well with traffic paterns. For instance, Loki has shown itself to create paterns of traffic in ICMP that are easily detected using 2 factor Benford's analysis. Eventually I hope to add these techniques into common use for IT Security.

  1. Geyer & Williamson, 2004, P 245[1] Taylor & Francis 2004
  2. Taylor & Francis 2000
  3. ibid

1 comment:

Craig S Wright said...

References

1. Casella, George & Berger, Roger L (2002) “Statistical Inference” Duxbury Advanced Series
2. Dobson, Annette J. (2002) “An Introduction to Generalized Linear Models” 2nd Ed. CHAPMAN & HALL/CRC
3. Givens, Geof H. & Hoeting, Jennifer A. (2005) “Computational Statistics” Wiley
4. Geyer, Christina Lynn & Williamson, Patricia Pepple (2004) “Detecting Fraud in Data Sets Using Benford's Law”, Communications in Statistics: Simulation and Computation, Volume 33, Number 1 / 2004, pp 229-246
5. Ley, E. (1996). “On the peculiar distribution of the U.S. stock indexes’ digits”. Amer. Statist. 50:311–313.
6. Maindonald, John & Braun, John (2004) “Data Analysis and Graphics Using R, An example based approach” Cambridge University Press
7. Nigrini, Mark (1994). “Using digital frequencies to detect fraud”. White Paper April/May, pp. 3–6.
8. Nigrini, Mark (1996). “A taxpayer compliance application of Benford’s law”. J. Amer. Taxation Assoc. 18:72–91.
9. R Development Core Team. (2006) “Writing R Extensions. R Foundation for Statistical Computing”, Vienna, Austria, version 2.3.1 edition
10. Rice, John A. (1999) “Mathematical Statistics and Data Analysis” Duxbury Press
11. Wright, Kevin (2006) “Benford’s Law, First Digit Plot function”, R-Wiki, R Project, 20th June 2006