Cox, R. (1946)
Probability, frequency and reasonable expectation. Am. J. Physics14: 1--13.

This document originated as an attempt at translating, into HTML 3, David MacKay's write-up of the ninth lecture, Bayesian Inference, in a course he taught in Information Theory. That was some time in '95, I think, and I've since been playing around at length with plaintext notations and lost enthusiasm for the available options for displaying maths in markup. So, late in '98, I've decided to transform the notation from David's reasonably orthodox LaTeX into plaintext HTML.

Copyright properly belongs to David; blame the markup and notation on Eddy.

Bayesian Inference

Fortunately, it is not a controversial statement that Bayes' theorem provides the correct language for describing communication over a noisy channel. But let's take a little tour of other applications of probabilistic inference.

Coherent inference can be mapped onto probabilities. Many textbooks on statistics do not mention this fact, so maybe it is worth using an example to emphasize the contrast between Bayesian inference and the orthodox methods of statistical inference involving estimators, confidence intervals, hypothesis testing, etc.

A first example of probability theory

When I was an undergraduate in Cambridge, I was privileged to receive supervisions from Steve Gull. Sitting at his desk in a dishevelled office in St. John's College, I asked him how one ought to answer an old Tripos question:

	Unstable particles are emitted from a source and decay at a distance x
	that has an exponential probability distribution with characteristic
	length λ.  Decay events can only be observed if they
	occur in a window extending from x= 1cm to x= 20cm. N decays are
	observed at locations [x(N), ..., x(1)].  What is λ?

I had scratched my head over this for some time. It was easy to invent an estimator h for λ, given by h = 1/mean(x), where mean(x) = sum(x)/N, that was appropriate for λ small compared to 20cm; with a little ingenuity and the introduction of ad hoc bins, promising estimators for λ large by comparison with 20cm could be constructed. But there was no obvious estimator that would work under all conditions.

Please stop and think about this problem for a moment.

Steve wrote:

P(x given λ)
= exp(-x/λ) / λ / Z(λ)

for 1 < x < 20

= 0

otherwise, with

= integral(1 to 20: x-> exp(-x/λ) / λ :)
= exp(-1/λ) - exp(-20/λ)

This seemed obvious enough. Then he wrote:

P(λ given [x(N), ..., x(1)]) =
P([x(N),...,x(1)] given λ) . P(λ) / P([x(N),...,x(1)])

which is proportional to

power(-N, λ.Z(λ)).exp(-sum(x)/λ) . P(λ).

Suddenly, the straightforward distribution P( [x(N),...,x(1)] given λ ), defining the probability of the data given the hypothesis λ, was being turned on its head so as to define the probability of a hypothesis given the data. A simple figure showed the probability of a single data point P(x given λ) as a familiar function of (the list) x, for different values of λ. Each curve was an innocent exponential, normalized to have area 1. Plotting the probability as a function of λ for a fixed value of x, something remarkable happened: a peak emerges.

The probability density P(x given λ) as a function of x.

The probability density P(x given λ) as a function of λ.

When plotted this way round, the function is known as the likelihood.

Steve summarised Bayes' theorem as embodying the fact that What you know about λ after the data arrive is what you knew before [P(λ)], and what the data told you [P([x(N),...,x(1)] given λ)]. Probabilities are used here to quantify degrees of belief.

To nip possible confusion in the bud, it must be emphasized that the hypothesis λ which correctly describes the situation is not a stochastic variable, and the fact that the Bayesian uses a probability distribution P does not mean that he thinks of the world as stochastically changing its nature between the states described by the different hypotheses. He uses the notation of probabilities to represent his beliefs about the mutually exclusive micro-hypotheses, of which only one is actually true. That probabilities can denote degrees of belief, given assumptions, seemed intuitive to me, and has been proved.

The posterior probability distribution of Bayes' Theorem represents the unique and complete solution to the problem. There is no need to invent estimators; nor do we need to invent criteria for comparing alternative estimators with each other. Whereas orthodox statisticians offer twenty-seven ways of solving a problem, and another twenty different criteria for deciding which of these solutions is the best, Bayesian statistics only offers one answer to a well-posed problem.

Our inference is conditional on our assumptions [for example, the prior P(λ)]. Critics view such priors as a difficulty because they are subjective, but I don't see how it could be otherwise. How can one perform inference without making assumptions? I believe that it is of great value that Bayesian methods force one to make these tacit assumptions explicit. First, once assumptions are made, the inferences are objective and unique, reproducible with complete agreement by anyone who has the same information and makes the same assumptions. For example, given assumptions H (e.g. those listed above) and data D, (e.g. that from an experiment measuring decay lengths) everyone will agree about the posterior probability of the decay length λ:

P(λ given D,H)
= P(D given λ,H) . P(λ given H) / P(D given H)

Second, when the assumptions are explicit, they are easier to criticize, and we can quantify the sensitivity of our inferences to the details of the assumptions. We can note from the likelihood curves that in the case of a single data point at x=5, the likelihood function is less strongly peaked than in the case x=3; the details of the prior P(λ) become more important if the mean(x) is close to 10.5. In the case x=12, the likelihood function doesn't have a peak at all. Such data merely rule out small values of λ, and don't give any information about the relative probabilities of large values of λ. So in this case, the details of the prior at the small λ end of things are not important, but at the large λ end, our prior is important.

Third, when we are not sure which of various alternative assumptions is the most appropriate for a problem, we can treat this question as another inference task. Thus, given data D and some unquestioned assumptions, I, we can compare alternative assumptions H using Bayes' theorem:

P(H given D,I)
= P(D given H,I) . P(H given I) / P(D given I),

Fourth, we can take into account our uncertainty regarding such assumptions when we make subsequent predictions. Rather than choosing one particular assumption H*, and working out our predictions about some quantity X, P(X given D,H*,I), we obtain predictions that take into account our uncertainty about H by using the sum rule: P(X given D, I) = sum(: H-> P(X given H,D,I) . P(H given D,I) :). (This is another contrast with orthodox statistics, in which it is conventional to test a default model, and then, if the test accepts the model, to use that model exclusively to make predictions.)

Steve thus persuaded me that

Probability theory reaches parts that ad hoc methods cannot reach.

Let's look at a few more examples of simple inference problems. The following example illustrates that there is more to Bayesianism than the priors.

An example of legal evidence

Two people have left traces of their own blood at the scene of a crime. Their blood groups can be reliably identified from these traces and are found to be of type O (a common type in the local population, having frequency p(O) = 60%) and of type AB (a rare type, with frequency p(AB) = 1%). A suspect is tested and found to have type O blood. Do these data D= (type O and AB blood were found at scene) make it more probable that this suspect was one of the two people present at the crime? A careless lawyer might claim that the fact that the suspect's blood type was found at the scene is positive evidence for the theory that he was present.

Denote the proposition the suspect and one unknown person were present by S. The alternative, unS, states two unknown people from the population were present. The prior in this problem is the prior probability ratio between the propositions S and unS. This quantity is important to the final verdict and would be based on all other available information in the case. Our task here is just to evaluate the contribution made by the data D, that is, the likelihood ratio, P(D given S,H)/P(D given unS,H). In general, a jury's task should be to multiply together carefully evaluated likelihood ratios from each independent piece of admissible evidence.

The probability of the data given S is the probability that one unknown person drawn from the population has blood type AB.

The probability of the data given unS is the probability that two unknown people drawn from the population have types O and AB.

In these equations H denotes the assumptions that two people were present and left blood there, and that the probability distribution of the blood groups of unknown people in an explanation is the same as the population frequencies, p(O) and p(AB).

Dividing, we obtain the likelihood ratio:

Thus the data in fact provide weak evidence against the supposition that this suspect was present.

This result may be found surprising, so let us examine it from various points of view. First consider the case of another suspect who has type AB. Intuitively, the data do provide evidence in favour of the theory S' that this suspect was present, relative to the null hypothesis unS. And indeed the likelihood ratio in this case is:

Now let us change the situation slightly; imagine that 99% of people are of blood type O, and the rest are of type AB. The data at the scene are the same as before. Consider again how these data influence our beliefs about a particular suspect of type O and another of type AB. Intuitively, we still believe that the presence of the rare AB blood provides positive evidence that the suspect of type AB was there. But do we still have the feeling that the fact that type O blood was detected at the scene favours the hypothesis that the type O suspect was present? If this were the case, that would mean that regardless of who the suspect is, the data make it more probable they were present, which would be absurd. The data may be compatible with any suspect of either blood type being present, but if they provide positive evidence for some theories, they must also provide evidence against other theories.

Here is another way of thinking about this: imagine that instead of two people's blood stains there are ten, and that in the entire local population of one hundred, there are ninety type O suspects and ten type AB suspects. Consider a particular type O suspect: without any other information, there is a one in 10 chance that he was at the scene. We now get the results of blood tests, and find that nine of the ten stains are of type AB, and one of the stains is of type O. Does this make it more likely that the type O suspect was there? No, although he could have been, there is now only a one in ninety chance that he was, since we know that only one person present was of type O.

Maybe the intuition is aided finally by writing down the formulae for the general case where n(O) blood stains of individuals of type O are found, and n(AB) of type AB, a total of N individuals in all, and unknown people come from a large population with fractions p(O), p(AB). The task is to evaluate the likelihood ratio for the two hypotheses S, the type O suspect and N-1 unknown others left N stains, and unS, N unknowns left N stains. The probability of the data under hypothesis unS is just the probability of getting n(O), n(AB) individuals of the two types when N individuals are drawn at random from the population:

P(n(O),n(AB) given unS)
= power(n(O), p(O)) . power(n(AB), p(AB)) . N! / n(O)! / n(AB)!

In the case of hypothesis S, we need to predict the distribution of the N-1 other individuals:

P(n(O),n(AB) given S)
= power(n(O)-1, p(O)) . power(n(AB), p(AB)) . (N-1)! / (n(O)-1)! / n(AB)!

The likelihood ratio is:

P(n(O), n(AB) given S) / P(n(O), n(AB) given unS)
= n(O) / N / p(O).

This is a very instructive result. The likelihood ratio, ie the contribution of this data to the question of whether the type O suspect was present, depends simply on a comparison of the frequency of type O blood in the observed data, n(O) / N, with the background frequency of type O blood in the population, p(O). There is no dependence on the counts of the other types found at the scene, or their frequencies in the population. If there are more type O stains than the average expected by chance (hypothesis unS), then the data gives evidence in favour of the presence of this type O suspect. Conversely, if there are fewer type O stains than the expected number under unS, then the data reduce the probability of the hypothesis that he was there. In the special case n(O)/N = p(O), the data contribute no evidence either way, regardless of the fact that the data are compatible with the hypothesis S.

Maintained by Eddy.