Benford's law

Benford's law is usually explained in terms of the first digits of numbers. If you look up the heights of the tallest hundred mountains in the world, the land areas covered by the world's assorted nations, or any number of other sources of data, and consider only the first (non-zero) digit of each number, you will get more ones than twos, more twos than threes and so on. Back in the old days when computation was done using tables of logarithms, it was widely observed that a book of such tables was more heavily thumbed (i.e. the edges of the pages were visibly dirtier) towards the front – i.e. low numbers – than towards the back.

Obviously, the law doesn't hold for arbitrary data sets; if we look at the heights, measured in feet, of human adults we get a distribution of first digits in which 1 (and probably 2) totally fails to show up; you maybe get a few at 3, some at 4, lots at 5, a fair number at 6 and barely any beyond that. If we measure the same data in metres, 1 accounts for nearly all of the answers, 2 for a very few and if you get any other digits at all they'll be from the tiny proportion of adults under one metre in height. This is because the distribution of values has mean (somewhere around a metre and a half) significantly larger than its standard deviation (somewhere around ten to twenty centimetres). Benford's law only really applies to data whose variations are large enough, compared to their average, to straddle at least the range from some value to 10 times that value. Furthermore, it's only really meaningful where zero isn't a candidate value; and it ignores sign, effectively coercing the data to positive values.

One can express any positive value as the result of multiplying a power of 10 by a number which is ≥1 but <10; this latter number is known as the mantissa of the value, while the power to which 10 was raised is known as the exponent. (I should also note that I'm taking care to write this using 10 (that is, one zero), which denotes whatever number base you're using, without presuming that you are using base ten; any whole number >1 will serve as base for the present discussion, albeit the mantissa depends on your choice of base). What Benford's law is really describing is the distribution of mantissas – tacitly ignoring the exponent, i.e. reducing our values modulo powers of 10. The distribution of first digits is simply the histogram one gets by discretizing the mantissa distribution. (For base two, the first digit histogram is fatuous, but the mantissa distribution tells the same tale as in all other bases.) In fact, what Benford's law is really saying is that the distribution of mantissas is proportional to 1/x←x for 1≤x<10. One can even infer the normalisation this requires; the natural logarithm ln gives the integral of 1/x←x between 1 and the input to ln; so the normalised form of our distribution is 1/ln(10)/x←x. Consequently, the distribution of first digits is ln((i+1)/i)/ln(10)←i.

Distributions

When we consider what's going on here theoretically, we can describe the numbers being considered by a source distribution from which we infer a distribution on the mantissas. To see how to perform this reduction, we must first pause to consider what a distribution is.

Our distribution says how often which numbers arise: if our numbers came from a random process, we'd call it a probability distribution, but even when they come from some other source we can use the formalism of a distribution to describe the relative frequency with which the possible numbers arise. In the simplest case we'd only have finitely many numbers (e.g. if we were to follow the reduction into the range from 1 to 10 with a reduction to just first digits) and could represent the distribution by a simple value for each number: this value is the proportion of our answers which will yield the given number. Adding up these values over several numbers, we get the proportion of answers which will be one-or-another of the several numbers; adding up all the values, we must get 1 (the proportion of answers that are some number or another – i.e. all answers).

However, when the available numbers form a continuum (e.g. the range from 1 to 10) we encounter a problem: there are infinitely many numbers, so the proportion of answers that will be (exactly) 2.73519 is zero, and likewise for any other particular number. However, we can still sensibly ask what proportion of the answers will lie between 2.73 and 2.74, instead of asking for an exact match: indeed, this is a more realistic description of the data available to us. So, instead of a value for each number, we use a function whose integral over any interval (e.g. from 2.73 to 2.74) gives the proportion of answers lying in that interval. Geometrically, if we plot the graph of this function (with number varying left to right and value of the function varying up and down), the integral is the area under the curve between the vertical lines that cut the horizontal axis at the points corresponding to the numbers at the ends of our interval.

Reducing modulo powers of 10

So, we start with a distribution f: for given numbers a and b, the integral of f from a to b gives the proportion of our answers that lie between a and b. Reduction into the range from 1 to 10 requires us to construct a corresponding distribution g for which: given numbers a and b between 1 and 10, the integral of g from a to b gives the proportion of our answers that lie between a and b, between a/10 and b/10, between 10.a and 10.b, between a/100 and b/100, or … etc. This gives g as a sum of terms, each inferred from f on the range from one power of 10 to the next. Considering any one of those terms, from 10ⁱ to 10ⁱ⁺¹, we have to transform the corresponding slice of f's graph onto the interval from 1 to 10 in order to obtain the term's contributions to g. Clearly the transformation involves scaling our numbers by a factor 1/10ⁱ so that 10ⁱ ends up at 1 and 10ⁱ⁺¹ ends up at 10. However, we must also preserve the area under the graph: which requires us to scale the value of f by the factor 10ⁱ (i.e. the height must be enlarged to compensate for the reduction in width). We are thus led to

One simple property of g follows trivially from this: considering the limit as x tends to 1 from above or 10 from below, we find (using a substitution j=1+i) that

so we know that g(10) is g(1)/10 – i.e. g is smaller at its high-number end than at its low-number end, by a factor of 10.

The construction of g takes the form: cut f up into strips, transform these onto one of them (the strip between 1 and 10) and add up the fragments. The transformation shrinks and expands each strip so as to preserve its area, so the wider strips get expanded vertically to compensate for the necessary horizontal shrinkage. Consequently, the large number strips (which are the wider ones) are emphasised at the expense of the small number strips. As a result, the right-hand tail of the distribution f is most prominently expressed in g.

Furthermore, the n-th derivative of 10ⁱ.f(10ⁱ.x) ←x is 10^i.(1+n) times f's n-th derivative at 10ⁱ.x: so the large number strips of f contribute even more forcefully to the derivatives of g than to its value, especially for the higher derivatives.

Thus, when f has a broad peak and tails off reasonably slowly (at least to begin with) we can expect g to look roughly like the curve 1/x ←x: we know its end-points lie on a curve of this form, and f's tail is the dominant contributor to the shape of g.

I first encountered Benford's law in a talk by Tom Körner in October 1983; I responded to it in January 1985 with an article in Eureka 45, in which I appear to have proved that f's tail will exhibit alternately +ve and −ve derivatives at its extremities, thereby tending to cause g's derivatives to follow a similar pattern – just like 1/x ←x. However, I have one of the misprinted copies of Eureka 45, in which the relevant pages are blank; and I can't remember my reasoning.

One can reasonably anticipate the alternating signs of derivatives by noting that a distribution, such as f, is necessarily non-negative and normalizable. The latter means its integral from any point out to infinity is finite. Taken with non-negativity, this implies that the distribution must tend to zero from above as it approaches infinity; and gives reasonable grounds to expect that there is a point beyond which its derivative is everywhere negative. (One can construct pathological distributions which defy this expectation, and those I draw from it, but we have no reason to expect to see them in real data-sets.) In such a case, tending to zero from above with negative derivative, the derivative must also tend to zero and we can reasonably expect there to be a point beyond which the second derivative is everywhere positive. This pattern continues, implying alternating signs for successive derivatives of the source distribution's right tail, which dominates the mantissa distribution and, particularly, its derivatives.

Now, any x⁻ⁱ←x with i positive exhibits the same pattern of alternating signs; however, having the end-points on g(x) proportional to 1/x precludes any other i than 1. The above argument is, however, rather vague – so it is fortunate that there is a far clearer way to arrive at the same conclusion.

Simpler proof in the logarithmic domain

The problem with the transformation above, extracting the mantissa distribution from the source distribution, is that its transformation has different effects on different sections of the source distribution; this makes it hard to intuit anything useful about what distribution to expect in the end. However, the fact that we're reducing values modulo powers of 10 is a strong hint that we should in fact be looking at the distribution of the logarithms (to base 10) of our data, rather than that of our data itself. In this logarithmic domain, the mantissa distribution becomes the distribution of the fractional part of the logarithm; ignoring exponents amounts to deeming the logarithms of two values equivalent if they differ by a whole number (or by a whole number multiplied by log(10) if we used some other log than base 10), so the reduction we need to perform amounts to simply cutting the distribution at each whole number, translating each of the resulting strips onto the standard unit interval and summing them. Since this transformation doesn't do any scaling to any of the strips, we can more readilly intuit what shape to expect the result to be: if the source distribution of logarithms of values is broad enough to span at least a few strips, and roughly symmetric in shape, we can reasonably expect the reduced distribution to be roughly uniform on the unit interval.

Having uniform distribution for the fractional part of the logarithm of our data would mean, for any 0≤a<b≤1, that the proportion of our data that have logarithm with fractional part between a and b would simply be b−a. This, in turn, implies that the proportion of our data having mantissa between 10^a and 10^b is b−a; so this is the the integral of g (our mantissa distribution) from 10^a to 10^b. Writing x for 10^a and h for b−a, we have h as the integral of g from x to 10^h.x. For small enough h, g(x) is thus well approximated by h/x/(10^h −1). Use exp for the function specified to equal its own derivative, with exp(0) = 1, and ln for its inverse, which makes ln(10) equal the integral from 1 to 10 of 1/t←t. We can then write 10^h as exp(h.ln(10)) and 10^h −1 is, for small enough h, well approximated (since h.ln(10) is close to 0 and exp'(0) = exp(0) = 1) by h.ln(10). We can thus infer that g(x) is well approximated by 1/x/ln(10). Since this doesn't depend on h, and the approximation gets arbitrarily good as h tends to zero, we can infer that this is indeed the mantissa distribution that would result from a uniform distribution on the fractional part of the logarithm of our data.

In general, if we have a distribution g on our source data, which can be written as a function X of some alternate parameter, then the distribution, G, on the alternate parameter satisfies G(y) = X'(y).g(X(y)). In our case, the alternate parameter is the logarithm of the source data and X is 10^x ←x; so G(y) = 10^y.log(10).g(10^y); and g(x) = G(log(x))/x/log(10). So if G is uniform on the unit interval, g(x) is simply proportional to 1/x on the range from 1 to 10. Deviations from uniformity of G thus map onto deviations from 1/x/log(10)←x for g. Since, in practice, logarithm tables were used for data drawn from very broad classes of sample sources, the domain-specific deviations may reasonably be expected to have averaged out, making the composite G more uniform than its components and the composite g correspondingly closer to Benford's law.

Thus the intuitively obvious distribution in the logarithmic domain (where our reduction is clean and simple) inescapably implies Benford's law for the raw data.

One interesting corollary of this is that we can expect that exponential tables (typically the other half of the book containing the logarithm tables) were uniformly thumbed.

Noise and confusion

There has been a lot of noise and confusion about Benford's law. My first brush with it arose in a public lecture in which the fact that it involves a 1/x←x distribution was portrayed as problematic – since a distribution of this form cannot be extended down to 0 or up to infinity, as its integrals from 0 to 1 and from 10 to infinity are unbounded, making it impossible to normalize the distribution. However, by construction, it's only a distribution on mantissas, which cannot exist outside the range from 1 to 10, so this is a non-issue.

Another (not entirely unrelated) form of confusion, which I've seen in popular science publications, is to interpret it as a scale invariance phenomenon, as if it somehow implied that real-world data sets have the same form at all scales. This is a plain mis-reading of the result, as should be evident from the above – the mantissa distribution (of sufficiently well spread data) doesn't change form when you apply some arbitrary scaling to the raw data: but this is true even when the raw data does indeed have a well-defined scale (as long as its spread is adequately large compared to its mean). We can safely expect to see Benford's law apply to the mantissas of heights of mountains and hills on Earth, but this doesn't change the fact that the raw data's distribution has a definite cut-off at about the ten thousand metre scale.

I have also seen Benford's law described as Benford's theorem (which is silly, because those describing it as such do not even pretend to know of a proof for it; indeed, they generally describe it as surprising or paradoxical): it would be better described as Benford's observation: there are broad classes of data sets for which the mantissa distribution follows a 1/x←x law.

There is nothing magical about Benford's law: it is a simple corollary of drawing data from distributions which are sufficiently broad.