back to article How to catch a fraudster – using 'top cop' Benford and the power of maths

Yes, we've been hit over the head enough times with the phrase "big data" to be aware of its presence, even though we've been up to our armpits in streams of huge unstructured datasets for years. Those of you who are analysts or data scientists will have already picked up a set of tools that help you find hidden information …

COMMENTS

This topic is closed for new posts.

Page:

  1. Neil Barnes Silver badge

    Benford...

    Isn't this also know as Zipf's law - usually applied to the relative usages of words?

    1. Dave 126 Silver badge

      Re: Benford...

      You might be thinking of the use of PKZip to compare many works, X, from a known author- say all of a student's past essays - with a piece of text of dubious authenticity, Y.

      You Zip X, and then Zip X+Y. If the second Zip file is a fair larger than the first, you might be dealing with a case of plagiarism.

      1. tomsk

        Re: Benford...

        No, Zipf's law really is a thing - it's to do with the power law distributions you find in various real-world phenomena, one of the best-known of which is indeed the frequency with which particular words appear in any given body of text in a natural language.

        1. Anonymous Coward
          Anonymous Coward

          Re: Benford...

          I was more curious what this has to do with unstructured data. An order amount would be stored in normal RDBMS and most of those support calculated indexes so it would be fairly easy to query. An even better solution would be to maintain a separate table with these values in it.

        2. Primus Secundus Tertius

          Re: Benford...

          @tomsk

          I came across another example of the Zipf Law.

          We were encoding digital maps, taking a square block of pixels and replacing it with a code number. This gave us a useful compression for those particular maps.

          The code numbers followed a Zipf distribution.

          1. Marvin the Martian

            Re: Benford...

            Zipf' Law is a generalization of Benford's (and older if I recall --- think Zipf based his work on Pareto / the Pareto Principle).

            Anyway, the whole "River" explanation seemed a bit overly long to me; a comparison closer to home so to speak is with the snaking "rivers" we live at: streets.

            So, why are more people living at house numbers beginning with 6 than with 7? Because every street with a no. 7 has a no.6, with a number 70 has a number 60--69, with a number 700 has a number 600--699; but conversely there are some streets that have 60-something numbers (maybe not all!) but stop before reaching 79; and those that do reach 79 may end between 601 and 799. [In the UK the set of 1000+ numbered houses is negligible.] There is more streets that reach the 100s than the 200s, so

            A street with houses 1--199 has more than 50% people living in a number starting on 1, about 5% starting on 2, 3, .. 9. A street with houses 1--299 has over 33% in 1.., over 33% in 2.., and the rest equally split over the remaining 7. And so forth... There's more streets ending in the 300s than in the 400s, more ending in the 200s than the 300s, etc.

            As with rivers, the causes of street length (and house number density) are myriad unrelated factors, mostly geographical, political, etc... hence Zipf applies.

  2. deadlockvictim

    More!

    Statistics wasn't as fun or as interesting as this when I was in college.

    More please!

    1. Carl Zetie

      Yes please Re: More!

      When I was taught statistics it was from the dull and dry perspective of "work out how likely something is to happen". It turns out that in real life, a much more interesting question is "here's what actually happened, work out how likely it is that it was random". Not only is that great for detecting fraud, it's also a basic tool of science; for example the discovery of the Higgs Boson basically boils down to "here's the decay events we recorded, how likely is it that we'd see that distribution of events without a Higgs Boson in the mix? Vanishingly small? OK, that's the discovery."

      More of the same, please.

      1. Diogenes
        Thumb Up

        excellent

        Beautifully timed. Read this last night, and today replaced a maths teacher (as an extra he wassupervising Naplan) and was teaching year 12 probablity.

        Had me intrigued so I looked up the other methods

        Used this article as the intro

  3. Anonymous Coward
    Thumb Up

    Amazed!

    More of this sort of thing!

  4. Anonymous Coward
    Thumb Up

    Nice one

    Edumacational.

  5. Evil Auditor Silver badge

    How to deal with limits?

    I was trying to apply Benford's law to a payment system but I was struggling with limits. Each person working with the system has an individual limit, e.g. someone can enter payments up to £500, other persons up to £750 or £12,000 etc. And what's more, the payments as such may have limits too, e.g. for some sort of business the payments may usually go up to £2,500, for others to £9,000 etc.

    I did identify several abnormalities according to Benford but for now I gave up as it didn't seem feasible to dig into those by hand. Someone in here got an idea how to deal with that?

    Btw, good start and looking forward to reading the rest of this series.

    1. Eaten Trifles
      Pint

      Re: How to deal with limits?

      The thing about Benford's is that it applies to numbers that are logarithmically distributed, i.e. most real-world numbers. If you got a sheet of logarithmic graph paper and threw darts at it at random you'd get most darts landing in a "number begins with a 1" section, because those sections are wider than the others. Benford's distribution. It works no matter what units (scaling factor) you use because log distributions are scale invariant.

      So here's my idea.... if you need to check using limits, get out your sheet of log graph paper again and mark off the limits of the data range. Scale the data to fit the paper if necessary. Then plot your data points. If the data is Benfordly-distributed, the points will appear to be randomly spread. If it isn't, the points will be clustered in the narrow-line sections of the graph and out of the wider-line sections.

      Beer, in case it doesn't work.

    2. Anonymous Coward
      Anonymous Coward

      Re: How to deal with limits?

      > I was trying to apply Benford's law to a payment system but

      > I was struggling with limits.

      Normalize each person's data to his individual limit?

      1. Loyal Commenter Silver badge

        Re: How to deal with limits?

        Normalize each person's data to his individual limit?

        Normalising the data by scaling the numbers by a fixed factor, however, makes the data fit better into a Benford-type distribution, diluting any effect you may be looking for.

        I'd suggest that the limit is irrelevant, unless you are expecting to always be reducing the value of payments to the value of the limit.

    3. Bob Wheeler
      Go

      Re: How to deal with limits?

      Read the link from the article to the the Journal of Accounting, and it talks about checking that situation.

      1. Evil Auditor Silver badge
        Thumb Up

        Re: How to deal with limits?

        Thanks for your answers!

        Well, it seems like there's no easy solution - I'll have to invest more, probably much more time and get the hands dirty...

        1. Bob Wheeler

          Re: How to deal with limits? @ Evil Auditor

          I was thinking more about this last night - sad I know.

          Given I'm not an accountant, statistican or an auditor, evil or otherwise.

          Anyway, if your looking at a set of transtions between 2500.00 and 5000,00 say. The smallist unit that cna change is 1 cent/penny, i.e. 2500.00 next value is 2500.01.

          So if you then multiple by 100, to get to whoel number.

          Next to look at the conecpt of limits, I thought that if muiltpling is OK (i.e. does not dsiruput the use of Benford's law) then why not divide. As the lowest limit 250000, then use that, but then I thought more and that would leave us with everything in the range of 1 and 2, not very helpful.

          So why not subtraction, you can't use the lower limit as potentional you get Zero, for some transactions.

          So howabout subtracting 249999 from each trancation. This would leave everything in the range of 1 upto 250001.

          By the time you then only look at the leading digit would/should fit into Benford's Law?

          1. Evil Auditor Silver badge

            Re: How to deal with limits? @ Evil Auditor

            Bob Wheeler, are you looking for a job at Eurostat? I'm pretty sure they do this kind of things on a regular basis. ;-) Seriously though, I understand a bit of statistics but not enough to know what the implications of your suggestion would be. I have the feeling that it might mess up with the numbers but going to discuss with the maths genius that is hiding somewhere in the basement of my office building...

            1. Bob Wheeler
              Pint

              Re: How to deal with limits? @ Evil Auditor

              Umm, been thinking more about this, sad, oh so very sad.

              I'm now not so sure about the subtraction side of things and how that might effect the numbers.

              While you just moving the trend line to the left, what happens when it crosses the boundary and it moves from the '1' column to the left, and therefor falles into the '9' column.

              I like basements, nice and snug, not to many visitors........

              Nurse, Is it time for my meds yet?

              1. Evil Auditor Silver badge
                Pint

                Re: How to deal with limits? @ Evil Auditor

                I had some similar thoughts, but what do I know. Unfortunately, I couldn't find the maths guy yet - it's Friday and on Fridays he's never been seen near the office.

  6. tfewster

    Fascinating, but I would still have expected 9 to have a higher incidence than 8 if the numbers are about money as ,e.g., retailers price products at 9.99 rather than 10.00

    1. Patrick R

      Those retailers prices are not natural.

    2. Terry Barnes

      "Fascinating, but I would still have expected 9 to have a higher incidence than 8 if the numbers are about money as ,e.g., retailers price products at 9.99 rather than 10.00"

      It doesn't actually work that way. Lots of products are priced at 9.99, but not as many as cost £8.xx and there aren't as many of those as things that cost £7.xx and so on.

    3. Yet Another Anonymous coward Silver badge

      > I would still have expected 9 to have a higher incidence than 8 if the numbers are about money

      It only apples to the incidence of the first digit.

    4. Loyal Commenter Silver badge

      If you buy two items priced at 5.99, 6.99, 7.99, 8.99 or 9.99, the total will still start with a 1.

    5. Cynic_999

      OK, so products are often sold for x9.99 I suspect 9.99 is not more common than 1.99 or 19.99 And what numbers come up in the final bill if you buy more than one item and VAT is added?

    6. tfewster
      Go

      Interesting responses, thank you, and they demonstrate even further how my intuition about approximate answers can be badly flawed...

      @ Patrick R "Those retailers prices are not natural"

      - Very good, on both levels :-) Yes, a bias has been introduced so they are "chosen numbers" rather than "random numbers" or "numbers that occur in nature"

      @ Terry Barnes "It doesn't actually work that way..."

      - Possibly, but I suspect retailers will also overprice a cheaper item up to the £10 "mental barrier" to compensate for those they have to underprice. (Ignoring competition etc.)

      @ YAAC "It only apples to the incidence of the first digit."

      - Yes, I wasn't counting the "insignificant" digits, just using them to demonstrate my theory about retailer pricing

      @ Loyal Commenter "If you buy two items priced..."

      - I notice you chose your set of numbers carefully, but yep, a "transform" breaks an even distribution. Transforms are back into the realm of pure maths, so we can get away from these damn statistics....

      @ Cynic_999 "I suspect 9.99 is not more common than 1.99..."

      http://www.googlefight.com/index.php?lang=en_GB&word1=9.99&word2=1.99

      OK, that's hardly rigorous and the results were closer than I expected! Interestingly, a googlefight between 9 and 8 (or any other 2 digits) seem to to follow Benford's Law.

      As above, I appreciate that transforms break even distributions, but a couple of counter-points: In the UK, retail prices almost always include VAT (to avoid arguments at the till, if it isn't by law). And how about big-ticket items? If you're buying a budget car, you expect that £9,999 price tag to include everything. If you want a "nice" car, £19.999 is just the base model and you are prepared to pay for extras.

      1. Loyal Commenter Silver badge

        @ Loyal Commenter "If you buy two items priced..."

        - I notice you chose your set of numbers carefully, but yep, a "transform" breaks an even distribution. Transforms are back into the realm of pure maths, so we can get away from these damn statistics....

        For the sake of brevity, yes I did 'select' my numbers, to illustrate that the majority of prices end up starting with the digit 1 if doubled. A more complete example follows:

        If you have items priced at x.99, and your values of x are evenly distributed (i.e. non-random, as in prices in a shop), these are the prices you get if you buy a given number of them (up to nine; as we are working in base ten, going further will give the same results; proof of this is left as an exercise for the reader...)

        .99 1.98 2.97 3.96 4.95 5.94 6.93 7.92 8.91

        1.99 3.98 5.97 7.96 9.95 11.94 13.93 15.92 17.91

        2.99 5.98 8.97 11.96 14.95 17.94 20.93 23.92 27.91

        3.99 7.98 11.97 15.96 19.95 23.94 27.93 31.92 35.91

        4.99 9.98 14.97 19.96 24.95 29.94 34.93 39.92 44.91

        5.99 11.98 17.97 23.96 29.95 35.94 41.93 47.92 53.91

        6.99 13.98 20.97 27.96 34.95 41.94 48.93 55.92 62.91

        7.99 15.98 23.97 31.96 39.95 47.94 55.93 63.92 71.91

        8.99 17.98 26.97 35.96 44.95 53.94 62.93 71.92 80.91

        9.99 19.98 29.97 39.96 49.95 59.94 69.93 79.92 89.91

        If you then count the incidence of each first digit, you get:

        19 : 15 : 12 : 9 : 8 : 5 : 6 : 4: 3

        Which is an almost perfect Benford distribution. Voila!

  7. Androgynous Cupboard Silver badge

    I am still digesting the concept of a Readers Digest digest. Good article, again please.

    1. Yet Another Anonymous coward Silver badge

      >I am still digesting the concept of a Readers Digest digest.

      I only read the summary

      1. cordwainer 1

        re: I only read the summary

        I only read the Condensed summary.

        1. Ian 55

          Re: re: I only read the summary

          I only read the title.

          1. Sorry that handle is already taken. Silver badge

            Re: re: I only read the summary

            Err... I didn't?

            1. buyone

              Re: re: I only read the summary

              I only have the Executive Summary read to me.

              Naaaahhhhh Nah Nah

  8. Dr Paul Taylor

    Scale independence

    At this point you might be wondering if this is to do with the units in which you choose to measure, but no, this phenomenon is unit-independent.

    It's exactly because it is unit- or more accurately scale-independent that it works. I am a little surprised not to have seen the word "logarithm" in the article.

    Scale issues make me a little sceptical of its value for fraud detection: I would imagine that credit limits and common denominations of payments would obscure the distribution.

    1. Terry Barnes

      Re: Scale independence

      Surprisingly not. It's a pretty standard test applied where fraud might be an issue - business expenses, bank transactions etc...

      Have a look at your bank statement for last month. Check out how many numbers start with '1' compared to all the others.

    2. harmjschoonhoven

      Re: Scale independence

      Julian Havel explains Benford's Law in Ch. 14 "It's a Logarithmic World" of his excellent book "Gamma, exploring Euler's constant".

      BTW, Benford's Law also holds for binary numbers: expect the number of zeros to be double the number of ones. Try it.

      1. Caesarius
        Holmes

        Re: Scale independence, Binary numbers

        Er, sorry, no. The numbers analysed do not include leading zeros, so binary numbers all begin with one. But yes, that is what Benford's Law would predict, even if it is a trivial case :-)

    3. T. F. M. Reader

      Re: Scale independence

      @Dr Paul Taylor: I am a little surprised not to have seen the word "logarithm" in the article.

      It is actually there if you look closely. ;-) [Hint: in the description of where Frank Benford started from.]

  9. Patrick R

    Scale

    In other words, you can have a much bigger scale of values when beginning with 1 than with 9.

    Example [1 to 1.999] and [9 to 9.999] ...feel the restriction?

  10. Arthur the cat Silver badge
    WTF?

    The R link is daft

    R the programming language is very good for statistics hacking, but the "for example R" link points to an article on IBM's System R which was a relational database. Both data related, but not the same thing.

  11. Michael H.F. Wilkinson Silver badge
    Thumb Up

    Very nice read

    We use Benford's Law in some assignments. I'll point students towards this nice informal introduction. Looking forward to the rest of the articles

  12. Anonymous IV

    House numbers

    Has anyone done any work on the distribution of digits of house numbers?

    Assume that each house number in a street is made up from individual numbers, so 13 has a 1 and a 3 screwed next to each other on its gatepost.

    Since streets aren't usually long enough for house numbers to go all the way up to 99 (or further!) there is going to be a bias of the distribution of the first digit towards 1 (since we don't usually number houses as 01, 02, etc) and the distribution of the second digit towards 0, since some streets will miss out on 25, 26, 27, 28 and 29, say.

    What will the distribution be when multiple streets in a town are combined together? Knowing how many metal plates with zeros, ones, twos, etc, to produce must be well-known to the manufacturer of house numbers!

    Not quite a Benford's Law distribution, though...

    1. disgruntled yank

      Re: House numbers

      A common US practice is to number outward, 100 per block, from a central point. In Washington, DC, that is the Capitol; in Denver it is (I think) Broadway and Exposition Sts. The applicability of Benford's law depends on how far this is carried. Within Washington, DC, the law does not apply. In Montgomery County, Maryland, which adjoins DC on the north, the numbering system continues, and for the north-south addresses Benford's law kicks in about the Beltway.

      The US government has made a big push to convert the old rural route delivery services to a common pattern of the "numbering outward" type, which must be increasing the reach of Benford's law here.

      1. Michael Wojcik Silver badge

        Re: House numbers

        A common US practice is to number outward, 100 per block, from a central point.

        In many places in the US, house numbers don't increment by a constant value, either. My house number ends in 19; the next-door neighbor's house number ends in 11. I'm not sure how they were assigned (possibly something to do with the platting, since house lots here do not correspond to surveyed parcels; my lot's legal description is 78 words long).

        It's also not uncommon to have house numbers up into the five digits in the US.

        In Japan, on the other hand, house numbers are assigned randomly, as part of a long-running prank being played against visitors.

        1. Anonymous Coward
          Anonymous Coward

          Re: House numbers

          > I'm not sure how they were assigned

          As I recall, and simplifying: allot 100 to each block, start from 1, take the length of the first façade and divide by the length of the block--that's the number of the second house, then take the length of the second façade, divide by block length and add to the second house's number: assign this to the third house and so on.

  13. 2460 Something
    Thumb Up

    Brilliant Stuff. Next article on an introduction to the simplicity of complex algebraic equations? :)

  14. disgruntled yank

    Quibbles

    " Below is some real data collected by Benford himself in 1938 from a copy of Reader’s Digest."

    No, below is a bar chart showing we know not what. The digits of numbers found in the text? The digits of the page numbers?

    1. phuzz Silver badge
      Facepalm

      Re: Quibbles

      (re-)Read the sentence before the one you quoted. The graph shows the frequency distribution of the leading digit of the numbers found in the magazine.

Page:

This topic is closed for new posts.

Other stories you might like