Information content

The notion dual to information content is how random is it ? For discrete probability distributions, I have a notion for that, whose form may potentially be extended to non-discreet situations. It goes like this:

Consider a finite collection, of size M, whose members I'll call buckets; and a finite collection, of size N, whose members I'll call blobs. If I throw each blob into a random bucket, ignoring such matters as where the buckets are and how many blobs are already in them, I can examine the probability of any given distribution of blobs arising among the buckets. This lets us infer a probability distribution on distributions of blobs among buckets. If we characterize this probability distribution, we can discover what characteristics the most probable distributions have.

All choices being independent, when I examine a randomly chosen bucket after such distribution, I can ask what the probability is of finding 0, 1, 2 or … blobs in the chosen bucket. The probability of finding i blobs in some bucket is formed as a product of three terms: each of the i blobs in our bucket had a probability of 1/M of landing in that bucket, contributing pow(1/M, i); each of the N−i blobs not in our bucket had a probability of (M−1)/M of not landing in that bucket, contributing pow((M−1)/M, N−i); and there are chose(N,i) = N!/i!/(N−i)! ways of choosing which i of the available N were to be the ones which did land in our given bucket.

We obtain a random distribution over our buckets, which we can view as a probability distribution scaled by N, characterized by a mapping from buckets to how many blobs landed in each, ({naturals}: n |M). We have sum(n) = N, and the probability of any given n arising is proportional to the number of ways of partitioning our N blobs over M for which the distribution is our given n, times pow(1/M, N) for the probability of each blob landing in the slot in which it did land. The number of partitions involved is simply N!/product(: n(i)! ←i |M), so the probability of seeing distribution n is N! / pow(M, N) / product(n!), in which N = sum(n).

Let p = (: n(i)/N ←i |M), so sum(p) = 1, and use the leading terms in Stirling's formula, log(N!) = (N+1/2).log(N) −N, to obtain

log(N!/product(n!))
≈ log(N).(N+1/2) −N −sum(: log(n(i)).(n(i)+1/2) −n(i) ←i |M)
= log(N).(N +1/2) −sum(log(n).(n+1/2))

albeit with some complications needed to deal with those i for which n(i) isn't large enough to apply Stirling's formula; but we can eliminate these by ignoring the i for which n(i) is zero (in product(n!), these don't contribute) and increasing M enough to ensure that all but negligibly few of the other n(i) are large enough. This also ensures that we can ignore the +1/2 terms, leaving us with N.log(N) −sum(n.log(n)). We know log(pow(M,N)) = N.log(M), so the probability of seeing n has, as its log,

N.log(N/M) −sum(n.log(n))
= sum(n).log(N/M) −sum(n.log(n))
= −sum(n.log(n/N)) −N.log(M)
= N.sum(p.log(1/p)) −N.log(M)

Thus the probability of seeing distribution p only depends on p via sum(p.log(1/p)). A sufficiently well-behaved integration (or procedure for infinite summation) over some domain can reduce probability measures on that domain to functions, like p: for which we have a measure of probability (relative to the integration, in fact) expressed by integral(p.log(1/p)). The original probability measure is p@integral and we can equally construct (p.log(1/p))@integral, whose total measure is thus our indicator of the probability of seeing distribution p. In any case, for any distribution p, define I(p) = sum(p.log(1/p)), with sum replaced by a suitable measure if appropriate.

What kinds of distribution are most probable ? We can use Lagrange's method of multipliers to solve the problem. Let p encode the system's state, so D is the collection of distributions on M, V and W are {scalars}, f is (V: sum(p.log(1/p)) ←p |D) and g is sum, with constraint value g(p) = 1. We have h as linear (V: h |W) with W = {scalars} = V, so h is just a scalar. So, differentiation with respect to p(i), for some i in M, takes f to −1−log(p(i)) and g to 1, so ∂f = h·∂g becomes 0 = 1+log(p(i))+h, which makes p(i) = exp(−1−h) so p is constant. This is exactly the kind of distribution we used in deciding what was the most random way to distribute p, so this shouldn't be surprising. Note that h.g−f, with derivative h+1+log(p), has diagonal second derivative, with entries 1/p all positive: so h.g−f is at a minimum and f−h.g is at a maximum: we've maximized f with respect to the constraints (the method of multipliers could as readily have given us a minimum, in general).

When we have more constraints, W has higher dimension, h is more interesting and p can take quite different forms.

See also

A more light-hearted (yet serious) over-view of coding theory: The story of Alice and Bob. Favourite quote:

Nothing else is like information. Information is very peculiar stuff. It can both be created and destroyed. You can steal it without removing it. You can often get some just by guessing.

Yet it can have great value. It can be bought and sold. One type of information is called Money. There are people who refuse to concede that money can be created and destroyed. They spend their entire lives altering records and making adjustments to ensure that every time a bit of money leaves some place, an equal bit seems to appear somewhere else. These people are called accountants.


Valid CSSValid HTML 4.01 Written by Eddy.