Even random processes follow some rules, despite not being entirely
predictable. Given a straightforward understanding of the nature of the
process, it's often possible to deduce quite a lot about its behaviour. Much of
what we can deduce about its behaviour can be expressed in terms of
a probability
distribution. In fact, two of these are sufficiently common and interesting
to warrant their own pages: the gaussian
(a.k.a. normal

) and gamma distributions.

The simplest case to describe is a distribution on pure numbers. Subject to, at most, a choice of scaling, this suffices to describe many one-parameter variates, such as the height distribution of a population, the length of time one waits at a bus stop or the number of people with whom you're going to have to share that big lottery win when you finally get it. Where the pure number in question is actually a whole number (in which case there's a natural choice for scaling, if any is needed at all) it's usually (but not always) a natural number (i.e. non-negative); the analysis in this case is discrete, with each possible outcome having a probability and relevant quantities being computed by summing the products of these probabilities with various functions of the variate. Otherwise, one has a real-valued variate, each individual outcome has formally zero probability and we can only discuss probabilities for the variate falling in an interval (or union of intervals); these can be expressed in terms of a measure, which can usually be represented by a density function that we integrate over an interval to obtain the probability of the variate falling in that interval. Relevant quantities are then computed in the same way as for the discrete case, but substituting the density for the probabilities and integration for summation.

Suppose I select one face of a coin and toss it repeatedly until it comes down with that face upwards, counting how many tosses it takes. I could likewise pick one face of a die and roll it repeatedly until it lands with that face upwards, counting the number of rolls. In each case, I perform a sequence of trials (coin tosses, die rolls) and the outcome of each trial is independent of the outcome of prior trials (althoug whether I bother to make the next trial may depend on the earlier trials). It's taken as given that each trial's probability of producing the waited-for outcome is the same as any other trial's. The geometric distribution describes how many trials I must do before the waited-for result arises.

In the general case, I have some random trial which, with probability p, yields a chosen outcome; and I count how many times I must repeat the trial before that outcome arises. The probability that the chosen outcome happens the very first time is simply p. The probability that it doesn't happen in the first n trials is power(n, 1−p), so the probability that it happens on trial 1+n is simply p.power(n, 1−p), with the very first trial being the special case with n = 0. Since sum(: power(n, q) ←n |{naturals}) is 1/(1−q) for 1>q>−1, taking q = 1−p yields p.sum(: power(n, 1−p) ←n |{naturals}) = 1, as required.

The expected number of trials I must make is then p.sum(: (1+n).power(n, 1−p) ←n |{naturals}), so consider f = (: sum(: power(n, t) ←n |{naturals}) ←t :) and observe that

- f'(t)
- = sum(: n.power(n−1, t) ←n |{naturals})
- = 0 + sum(: (1+n).power(n, t) ←1+n |{naturals})

so p.f'(1−p) is exactly the expected value we were looking for. However, since (for 1>t>−1) f(t) = 1/(1−t) = power(−1, 1−t), we can compute f'(t) = power(−2, 1−t) and infer that our expected value is p/p/p = 1/p. This makes sense: the probability per trial of success is p, so we expect one success per roughly 1/p trials.

Let N be the number of trials we actually end up making; it's a random
variate and the probability that N is 1+n is p.power(n, 1−p). For any
natural m, the expected value of (N+m)!/(N−1)! is p.sum(:
(n+1+m)!.power(n, 1−p)/n! ←n |{naturals}), which is the (1+m)-th
derivative of f, evaluated at 1−p, namely (1+m)!/power(p, 1+m). The case
m = 0 is simply the expected value of N, seen above. The case m = 1 gives us
the expected value of N.(1+N) as 2/p/p, so the expected value of N.N is 2/p/p
−1/p and the variance of N is (1−p)/p/p. Higher values of m imply
expected values for the higher powers of N; the expected powers of N are also
known as moments

of N. Because the successive derivatives of f yield the
moments, f is known as the moment-generating function

for N.

If some event has a certain probability per unit time of happening, we can
ask for the distribution of intervals between times when it *does*
happen. Where the geometric distribution was discrete (its random variate could
only take whole values), this shall yield a continuum distribution (its random
variate can take any positive real value, not just a whole value).

Simple one-dimensional distributions are fine when we only have one thing to measure, but what about when we have several ? A natural answer is to model these using a vector space, with one co-ordinate for each variate; we then have a single vector-valued variate to encode our data. As ever, the mode of a distribution is simply wherever it takes its highest value; this definition works just fine in a vector space.

When a variate's values lie in a vector space, we can carry over the
definition of the mean quite straightforwardly from the one-dimensional case;
scale each possible outcome (now a vector value the variate may take, rather
than just a number) by its probability (or by the variate's distribution's
density for that outcome) and sum (or integrate) over possible outcomes. The
result shall be a vector, since it's obtained by scaling a bunch of vectors and
summing. We can subtract this mean from any value the variate is capable of
taking; both the specification of variance and
the gaussian distribution call for us to do this and
to square

the result. This obliges us, in the multi-dimensional case, to
ask what the square

of a vector might be.

One common square

of a vector is its inner product with itself, its
squared length; but this depends on a choice of metric (or, equivalently, of
basis). In general, a square

of a vector is the result of supplying the
vector as both inputs to some bilinear map on our vector space, V; and any
bilinear map on V may be factorised via the tensor product of V with itself,
V⊗V. Thus the most general square

of a vector v in V that we can
come up with is in fact the tensor v×v in V⊗V. Using this we can
carry over the usual definition of the variance of our random variate, averaging
(v−m)×(v−m) over possible (vector) values v of our variate,
with m being the (vector) mean of our variate. The result is a tensor in
V⊗V; as a scaled sum of squares, it is necessarilly symmetric so there is
some basis of V for which it is diagonal.

The variance of a vector-valued variate is thus a quadratic form on the dual of the vector space of values of the variate. Another such tensor quantity we can define, which encodes correlations among the components of the variate, is a double average, over two variables ranging over the vector space, of the product of their differences from the mean, scaled by the variate's density at each.

So much for mode, mean and variance: what about median ? For a simple
variate, the median is the mid-point

for which the distribution's totals
on either side of it are equal. One can
stretch the definition of median to apply to vector-valued variates, but
there is then no guarantee that there *exists* a median: indeed, having a
median is a strong symmetry constraint on a distribution. None the less, some
distributions *do* have medians; most obviously, the multi-dimensional
gaussian, whose median is its mean.