The notion dual to information content is how random is it ?

For discrete probability distributions, I have a notion for that, whose form
may potentially be extended to non-discreet situations. It goes like
this:

Consider a finite collection, of size M, whose members I'll call buckets;
and a finite collection, of size N, whose members I'll call blobs. If I throw
each blob into a random bucket, ignoring such matters as where the buckets are
and how many blobs are already in them, I can examine the probability of any
given distribution of blobs arising among the buckets. This lets us infer a
probability distribution on distributions of blobs among buckets. If we
characterize this probability distribution, we can discover what
characteristics the most probable

distributions have.

All choices being independent, when I examine a randomly chosen bucket
after such distribution, I can ask what the probability is of finding 0, 1, 2
or … blobs in the chosen bucket. The probability of finding i blobs in
some bucket is formed as a product of three terms: each of the i blobs in our
bucket had a probability of 1/M of landing in that bucket, contributing
pow(1/M, i); each of the N−i blobs *not* in our bucket had a
probability of (M−1)/M of not landing in that bucket, contributing
pow((M−1)/M, N−i); and there are chose(N,i) = N!/i!/(N−i)!
ways of choosing which i of the available N were to be the ones
which *did* land in our given bucket.

We obtain a random distribution over our buckets, which we can view as a probability distribution scaled by N, characterized by a mapping from buckets to how many blobs landed in each, ({naturals}: n |M). We have sum(n) = N, and the probability of any given n arising is proportional to the number of ways of partitioning our N blobs over M for which the distribution is our given n, times pow(1/M, N) for the probability of each blob landing in the slot in which it did land. The number of partitions involved is simply N!/product(: n(i)! ←i |M), so the probability of seeing distribution n is N! / pow(M, N) / product(n!), in which N = sum(n).

Let p = (: n(i)/N ←i |M), so sum(p) = 1, and use the leading terms in Stirling's formula, log(N!) = (N+1/2).log(N) −N, to obtain

- log(N!/product(n!))
- ≈ log(N).(N+1/2) −N −sum(: log(n(i)).(n(i)+1/2) −n(i) ←i |M)
- = log(N).(N +1/2) −sum(log(n).(n+1/2))

albeit with some complications needed to deal with those i for which
n(i) isn't large enough to apply Stirling's formula; but we can eliminate
these by ignoring the i for which n(i) is zero (in product(n!), these don't
contribute) and increasing M enough to ensure that all but negligibly few of
the other n(i) *are* large enough. This also ensures that we can
ignore the +1/2 terms, leaving us with N.log(N) −sum(n.log(n)). We know
log(pow(M,N)) = N.log(M), so the probability of seeing n has, as its
log,

- N.log(N/M) −sum(n.log(n))
- = sum(n).log(N/M) −sum(n.log(n))
- = −sum(n.log(n/N)) −N.log(M)
- = N.sum(p.log(1/p)) −N.log(M)

Thus the probability of seeing distribution p only depends on p via sum(p.log(1/p)). A sufficiently well-behaved integration (or procedure for infinite summation) over some domain can reduce probability measures on that domain to functions, like p: for which we have a measure of probability (relative to the integration, in fact) expressed by integral(p.log(1/p)). The original probability measure is p@integral and we can equally construct (p.log(1/p))@integral, whose total measure is thus our indicator of the probability of seeing distribution p. In any case, for any distribution p, define I(p) = sum(p.log(1/p)), with sum replaced by a suitable measure if appropriate.

What kinds of distribution are most probable ? We can use Lagrange's method of multipliers to solve the problem. Let p encode the system's state, so D is the collection of distributions on M, V and W are {scalars}, f is (V: sum(p.log(1/p)) ←p |D) and g is sum, with constraint value g(p) = 1. We have h as linear (V: h |W) with W = {scalars} = V, so h is just a scalar. So, differentiation with respect to p(i), for some i in M, takes f to −1−log(p(i)) and g to 1, so ∂f = h·∂g becomes 0 = 1+log(p(i))+h, which makes p(i) = exp(−1−h) so p is constant. This is exactly the kind of distribution we used in deciding what was the most random way to distribute p, so this shouldn't be surprising. Note that h.g−f, with derivative h+1+log(p), has diagonal second derivative, with entries 1/p all positive: so h.g−f is at a minimum and f−h.g is at a maximum: we've maximized f with respect to the constraints (the method of multipliers could as readily have given us a minimum, in general).

When we have more constraints, W has higher dimension, h is more interesting and p can take quite different forms.

A more light-hearted (yet serious) over-view of coding theory: The story of Alice and Bob. Favourite quote:

Written by Eddy.Nothing else is like information. Information is very peculiar stuff. It can both be created and destroyed. You can steal it without removing it. You can often get some just by guessing.

Yet it can have great value. It can be bought and sold. One type of information is called Money. There are people who refuse to concede that money can be created and destroyed. They spend their entire lives altering records and making adjustments to ensure that every time a bit of money leaves some place, an equal bit seems to appear somewhere else. These people are called accountants.