- Cox, R. (1946)
- Probability, frequency and
reasonable expectation.
*Am. J. Physics***14: 1--13**.

This document originated as an attempt at translating, into HTML 3, David MacKay's write-up of
the ninth lecture, Bayesian Inference

, in a course he taught in
Information Theory. That was some time in '95, I think, and I've since been
playing around at length with plaintext notations and lost enthusiasm for the
available options for displaying maths in markup. So, late in '98, I've decided
to transform the notation from David's reasonably orthodox LaTeX into plaintext HTML.

Copyright properly belongs to David; blame the markup and notation on Eddy.

Fortunately, it is not a controversial statement that Bayes' theorem provides the correct language for describing communication over a noisy channel. But let's take a little tour of other applications of probabilistic inference.

Coherent inference can be mapped onto
probabilities. Many textbooks on statistics do not mention this fact, so maybe
it is worth using an example to emphasize the contrast between Bayesian
inference and the orthodox methods of statistical inference involving
estimators, confidence intervals, hypothesis testing, *etc*.

When I was an undergraduate in Cambridge, I was privileged to receive supervisions from Steve Gull. Sitting at his desk in a dishevelled office in St. John's College, I asked him how one ought to answer an old Tripos question:

```
Unstable particles are emitted from a source and decay at a distance x
that has an exponential probability distribution with characteristic
length $\lambda $. Decay events can only be observed if they
occur in a window extending from x= 1cm to x= 20cm. N decays are
observed at locations [x(N), ..., x(1)]. What is λ?
```

I had scratched my head over this for some time. It was easy to invent an
estimator

h for λ, given by h = 1/mean(x), where mean(x) =
sum(x)/N, that was appropriate for λ small compared to 20cm; with
a little ingenuity and the introduction of *ad hoc* bins, promising
estimators for λ large by comparison with 20cm could be constructed. But
there was no obvious estimator that would work under all conditions.

Please stop and think about this problem for a moment.

Steve wrote:

- P(x given λ)
- = exp(-x/λ) / λ / Z(λ)
for 1 < x < 20

- = 0
otherwise, with

- Z(λ)
- = integral(1 to 20: x-> exp(-x/λ) / λ :)
- = exp(-1/λ) - exp(-20/λ)

This seemed obvious enough. Then he wrote:

- P(λ given [x(N), ..., x(1)]) =
- P([x(N),...,x(1)] given λ) . P(λ) / P([x(N),...,x(1)])
which is proportional to

- power(-N, λ.Z(λ)).exp(-sum(x)/λ) . P(λ).

Suddenly, the straightforward distribution P( [x(N),...,x(1)] given λ ), defining the probability of the data given the hypothesis λ, was being turned on its head so as to define the probability of a hypothesis given the data. A simple figure showed the probability of a single data point P(x given λ) as a familiar function of (the list) x, for different values of λ. Each curve was an innocent exponential, normalized to have area 1. Plotting the probability as a function of λ for a fixed value of x, something remarkable happened: a peak emerges.

The probability density P(x given λ) as a function of x.

The probability density P(x given λ) as a function of λ.

When plotted this way round, the function is known as the likelihood

.

Steve summarised Bayes'
theorem as embodying the fact that What you know about λ after the
data arrive is what you knew before [P(λ)], and what the data told you
[P([x(N),...,x(1)] given λ)]

. Probabilities are used here to quantify
degrees of belief.

To nip possible confusion in the bud, it must be emphasized that the
hypothesis λ which correctly describes the situation is *not* a
stochastic variable, and the fact that the Bayesian uses a probability
distribution P does *not* mean that he thinks of the world as
stochastically changing its nature between the states described by the different
hypotheses. He uses the notation of probabilities to represent his beliefs about
the mutually exclusive micro-hypotheses, of which only one is actually true.
That probabilities can denote degrees of belief, given assumptions, seemed
intuitive to me, and has been proved.

The posterior probability distribution of Bayes' Theorem represents the unique and complete solution to the problem. There is no need to invent estimators; nor do we need to invent criteria for comparing alternative estimators with each other. Whereas orthodox statisticians offer twenty-seven ways of solving a problem, and another twenty different criteria for deciding which of these solutions is the best, Bayesian statistics only offers one answer to a well-posed problem.

Our inference is conditional on our assumptions [for example, the prior
P(λ)]. Critics view such priors as a difficulty because they are
subjective

, but I don't see how it could be otherwise. How can one
perform inference without making assumptions? I believe that it is of great
value that Bayesian methods force one to make these tacit assumptions explicit.
First, once assumptions are made, the inferences are objective and unique,
reproducible with complete agreement by anyone who has the same information and
makes the same assumptions. For example, given assumptions H (e.g. those listed
above) and data D, (e.g. that from an experiment measuring decay lengths)
everyone will agree about the posterior probability of the decay length
λ:

- P(λ given D,H)
- = P(D given λ,H) . P(λ given H) / P(D given H)

Second, when the assumptions are explicit, they are easier to criticize, and we can quantify the sensitivity of our inferences to the details of the assumptions. We can note from the likelihood curves that in the case of a single data point at x=5, the likelihood function is less strongly peaked than in the case x=3; the details of the prior P(λ) become more important if the mean(x) is close to 10.5. In the case x=12, the likelihood function doesn't have a peak at all. Such data merely rule out small values of λ, and don't give any information about the relative probabilities of large values of λ. So in this case, the details of the prior at the small λ end of things are not important, but at the large λ end, our prior is important.

Third, when we are not sure which of various alternative assumptions is the most appropriate for a problem, we can treat this question as another inference task. Thus, given data D and some unquestioned assumptions, I, we can compare alternative assumptions H using Bayes' theorem:

- P(H given D,I)
- = P(D given H,I) . P(H given I) / P(D given I),

Fourth, we can take into account our uncertainty regarding
such assumptions when we make subsequent predictions. Rather than choosing one
particular assumption H*, and working out our predictions about some quantity X,
P(X given D,H*,I), we obtain predictions that take into account our uncertainty
about H by using the sum rule: P(X given D, I) = sum(: H-> P(X given H,D,I)
. P(H given D,I) :). (This is another contrast with orthodox statistics, in
which it is conventional to test

a default model, and then, if the test
accepts

the model, to use that model exclusively to make predictions.)

Steve thus persuaded me that

Probability theory reaches parts thatad hocmethods cannot reach.

Let's look at a few more examples of simple inference problems. The following example illustrates that there is more to Bayesianism than the priors.

Two people have left traces of their own blood at the scene of a crime.
Their blood groups can be reliably identified from these traces and are found to
be of type O

(a common type in the local population, having frequency
p(O) = 60%) and of type AB

(a rare type, with frequency p(AB) = 1%). A
suspect is tested and found to have type O

blood. Do these data D= (type
O

and AB

blood were found at scene) make it more probable that
this suspect was one of the two people present at the crime? A careless lawyer
might claim that the fact that the suspect's blood type was found at the scene
is positive evidence for the theory that he was present.

Denote the proposition the suspect and one unknown person were
present

by S. The alternative, unS, states two unknown people from the
population were present

. The prior in this problem is the prior probability
ratio between the propositions S and unS. This quantity is important to the
final verdict and would be based on all other available information in the
case. Our task here is just to evaluate the contribution made by the data D,
that is, the likelihood ratio, P(D given S,H)/P(D given unS,H). In general, a
jury's task should be to multiply together carefully evaluated likelihood ratios
from each independent piece of admissible evidence.

The probability of the data given S is the probability that one unknown person drawn from the population has blood type AB.

- P(D given S,H) = p(AB)

The probability of the data given unS is the probability that two unknown people drawn from the population have types O and AB.

- P(D given unS,H) = 2. p(O) . p(AB)

In these equations H denotes the assumptions that two people were present and left blood there, and that the probability distribution of the blood groups of unknown people in an explanation is the same as the population frequencies, p(O) and p(AB).

Dividing, we obtain the likelihood ratio:

- P(D given S,H)/P(D given unS,H) = 1 / 2 / p(O) = 0.5 / 0.6 = 0.83

Thus the data in fact provide weak evidence *against* the
supposition that this suspect was present.

This result may be found surprising, so let us examine it from various points of view. First consider the case of another suspect who has type AB. Intuitively, the data do provide evidence in favour of the theory S' that this suspect was present, relative to the null hypothesis unS. And indeed the likelihood ratio in this case is:

- P(D given S',H) / P(D given unS,H) = 1 / 2 / p(AB) = 50.

Now let us change the situation slightly; imagine that 99% of people
are of blood type O, and the rest are of type AB. The data at the scene are the
same as before. Consider again how these data influence our beliefs about a
particular suspect of type O and another of type AB. Intuitively, we still
believe that the presence of the rare AB blood provides positive evidence that
the suspect of type AB was there. But do we still have the feeling that the
fact that type O blood was detected at the scene favours the hypothesis that the
type O suspect was present? If this were the case, that would mean that
regardless of who the suspect is, the data make it more probable they were
present, which would be absurd. The data may be *compatible* with any
suspect of either blood type being present, but if they provide positive
evidence for some theories, they must also provide evidence against other
theories.

Here is another way of thinking about this: imagine that instead of two people's blood stains there are ten, and that in the entire local population of one hundred, there are ninety type O suspects and ten type AB suspects. Consider a particular type O suspect: without any other information, there is a one in 10 chance that he was at the scene. We now get the results of blood tests, and find that nine of the ten stains are of type AB, and one of the stains is of type O. Does this make it more likely that the type O suspect was there? No, although he could have been, there is now only a one in ninety chance that he was, since we know that only one person present was of type O.

Maybe the intuition is aided finally by writing down the formulae for the
general case where n(O) blood stains of individuals of type O are found, and
n(AB) of type AB, a total of N individuals in all, and unknown people come from
a large population with fractions p(O), p(AB). The task is to evaluate the
likelihood ratio for the two hypotheses S, the type O suspect and N-1 unknown
others left N stains

, and unS, N unknowns left N stains

. The
probability of the data under hypothesis unS is just the probability of getting
n(O), n(AB) individuals of the two types when N individuals are drawn at random
from the population:

- P(n(O),n(AB) given unS)
- = power(n(O), p(O)) . power(n(AB), p(AB)) . N! / n(O)! / n(AB)!

In the case of hypothesis S, we need to predict the distribution of the N-1 other individuals:

- P(n(O),n(AB) given S)
- = power(n(O)-1, p(O)) . power(n(AB), p(AB)) . (N-1)! / (n(O)-1)! / n(AB)!

The likelihood ratio is:

- P(n(O), n(AB) given S) / P(n(O), n(AB) given unS)
- = n(O) / N / p(O).

This is a very instructive result. The likelihood ratio, *ie* the
contribution of this data to the question of whether the type O suspect was
present, depends simply on a comparison of the frequency of type O blood in the
observed data, n(O) / N, with the background frequency of type O blood in the
population, p(O). There is no dependence on the counts of the other types found
at the scene, or their frequencies in the population. If there are more type O
stains than the average expected by chance (hypothesis unS), then the data gives
evidence in favour of the presence of this type O suspect. Conversely, if there
are fewer type O stains than the expected number under unS, then the data reduce
the probability of the hypothesis that he was there. In the special case n(O)/N
= p(O), the data contribute no evidence either way, regardless of the fact that
the data are compatible with the hypothesis S.