Incremental statistical computation

When data arrive iteratively, one after another, and we want to know statistics about them after they have come in, it is of course possible to simply collect all the data and then, after the last has come, compute the desired statistics. However, this requires storage proportaional to the number of data values being received, so it is preferable – if all we want is the final statistics – to compute the results we want incrementally. This can scale much better.

If the only statistic we need to compute is the mean (the sum of the data, divided by how many there were) it suffices to just keep track of the current mean, m, and how many data went into computing it, n. When a new datum v comes in, we can update m to (n.m +v) / (n +1) = m +(v −m) / (n +1) and add one to n. Doing the former update by adding (v −m) / (n +1) to m is then a natural approach that limits arithmetic imprecisions when n gets large. This incremental computation is cheap and manages a tolerable balance of efficiency and accuracy.

In many cases we would also like to know the standard deviation – a measure of how much the data vary about the mean – as well as the mean. The square of this is called the variance and is more directly computed, so is what I'll compute here. The variance is defined to be the average of the squares of the data's differences from the mean. A tidy piece of mathematics, which I'll come to below, shows that's equal to the average of the squares of the data minus the square of the mean. However, for technical statistical reasons, one actually divides the sum of squares of differences from the mean by one less than the actual number of data that went into that sum, because the mean being used is derived from the same data whose distances from it we're averaging – the variance is defined in terms of the differenes of the data from the true population mean, for which our sample mean is only a current best estimate, and it turns out that this change in scaling corrects for that. The details I leave for elsewhere, but for present purposes I'll just take this as given and work with this scaling for the sample variance.

Iterative Update

We're naturally going to still want our incremental computation of the mean, for which we need to remember how many data we've seen and what the running average is, so let's now suppose we also have the running variance. We have a notional sequence ({real}: data :{naturals}), from which we've seen (: data |n) for some natural n. We haven't recorded the values of all those data, but we have kept track of n, mean(n) = sum(: data :n) / n and:

vary(n): = sum(: square(data(i) −mean(n)) ←i :n) / (n −1); = (sum(: square∘data :n) −2.mean(n).sum(:data:n) +n.square(mean(n))) / (n −1); = (sum(: square∘data :n) −n.square(mean(n))) / (n −1)

in which square = (: x.x ←x :) = power(2). When we receive data(n), we can, as above, obtain mean(n +1) by adding (data(n) −mean(n))/n to mean(n); how do we need to update vary(n) to get vary(n +1) ?

We know s = vary(n), m = mean(n) and n. We're given data(n), from which we compute e = data(n) −m to get mean(n +1) = mean(n) +e/(n +1). To see how to compute vary(n +1) from s, m, e and n, using data(n) = m +e, observe:

s = vary(n): = (sum(: square∘data :n) −n.square(m)) / (n −1)
vary(n +1): = (sum(: square∘data :n +1) −(n +1).square(m +e/(n +1))) / n; = (s.(n −1) +n.m.m +square(m +e) −(n +1).square(m +e/(n +1))) / n; = (s.(n −1) +n.m.m +m.m +2.m.e +e.e −(n +1).m.m −2.m.e −e.e/(n +1)) / n; = (s.(n −1) +e.e −e.e/(n +1)) / n; = s.(1 −1/n) +e.e/(n +1)
vary(n +1) −vary(n): = e.e/(n +1) −s/n

Since we divide s/n, we need to carefully consider what to do on the first iteration, where n = 0 and we can't divide by it. The correct solution must give mean(1) = data(0), vary(1) = 0, with mean(0) and vary(0) formally undefined (each divides an empty sum by zero, trying to estimate a property of the distribution from no data). This means that, if we're holding n, mean(n), vary(n) in some triple count, mean, vary of local variables, our update step looks like (using Python code to express it):

def update(count, mean, vary, datum):
    if not count: # First datum
        return 1, datum, 0
    gap = datum -mean
    ee = gap * gap
    sn = vary / count
    count += 1
    step = gap / count
    mean += step
    vary += gap * step -sn
    return count, mean, vary

(in which += increments the variable on its left by the quantity computed on its right, using (when relevant) the prior value of what's on the left; and * is the multiplication operator). Note that the update to vary only depends on mean via its use in computing gap. When available, a prior knowledge value could be used in place of mean, saving the need to compute it (although I'd still track the population average error from the prior knowledge expectation, just to be sure, even when not using it in computing the variance).

It is thus possible to incrementally compute the mean and variance of data, as they arrive, without having to record all of the data. All one needs to record are the running values of the quantities one wants at the end in any case.

Alternatively, in place of vary, we could keep track of vary(n).(n −1), the plain sum of squared differences from the current mean. If we do this, then the update is just

n.vary(n +1) −(n −1).vary(n) = n.e.e/(n +1)

making, with sqdiff(n) = n.vary(n), the update step:

def update(count, mean, sqdiff, datum):
    gap = datum - mean
    count += 1
    step = gap / count
    mean += step
    sqdiff += gap * (datum -mean)
    return count, mean, sqdiff

exploiting the fact that data(n) −mean(n +1) = e −e/(n +1) = n.e/(n +1), which we can multiply by e to get the change in sqdiff. Multiplying the gap between prior mean and the new datum by the gap between updated mean and datum thus substitutes a subtraction for a multiplication (by prior count) and a division (by the updated count), either of which would be more expensive than the subtraction. For a practical use of this algorithm, see this review of code one of my colleagues has written (which is what prompted me, as reviewer, to do this analysis).

Decaying Update

Consider, now, the case where our data are samples from the value of something that is, or may be, varying. For example, the data may be measurements, known to be subject to random errors (which is why we need to collect several and average to get a good estimate of its true value), of a varying physical quantity. As long as we get new data values fast compared to the rate of variation, we can still average our data to get a useful running average, but now we want to discard older values in favour of newer ones. We could do that by holding a finite buffer of the past data and recomputing stats from those in the buffer as we go, dropping the oldest as we add each new entry. Even then we want to pay more attention to the more recent entries, so we'll want a weighted sum, where the weights of newer entries are larger than those of older ones. An obvious and simple way to do that is for each entry's weight to be the weight of the next most recent divided by some constant factor. As long as the scaling is big enough that the weight of the last entry in our buffer is negligible, continuing past the end of our finite buffer would give the entries beyond it negligible weight, to, so it doesn't make much difference whether we have a finite buffer or the whole sequence of all we've seen. Thus, if we can find a way to do updates to just the current values of our final results, and a few helper variables, we can even dispense with the finite buffer.

So let's suppose, as before, that we have an incoming stream of data, modelled as a function (: data :{naturals}) of which we, at any given moment, only know (: data |n) for some natural n, and we keep track of running values of the mean and variance as, for some constant q > 1:

mean(n) = sum(: data(n −1 −i)/power(i, q) ←i :n)/W(n)
vary(n) = sum(: square(data(n −1 −i) −mean(n))/power(i, q) ←i :n)/W(n)

wherein q is the factor by which we scale down our past data at each turn and W(n) is the sum of the various weights by which we're scaling our data,

W(n) = sum(: 1 / power(i, q) ←i :n) = (1 −1 / power(n, q)) / (1 −1 / q)

One might well, at least for the first few iterations, keep track of that just by adding successive powers of 1/q to a running total, but once n gets reasonably big (maybe 4 and beyond) this formula shall let you get a more accurate account of the actual sum and, in particular, for large n it is just q/(q −1).

What, then, are the update rules for count, mean, vary ? For W, the update rule is W(n +1) = 1 +W(n)/q; divide current value by q, add 1. It is thus worth noting in advance that:

1 −1/W(n +1) = W(n)/W(n +1)/q = 1/(1 +q/W(n))

which I shall exploit below. As above, let's suppose we know n, m = mean(n), s = vary(n) and have now received data(n). We can either record W(n) and update as noted above, or just compute it by brute force given n, q. We get:

m = sum(: data(n −1 −i).power(i, q) ←i :n)/W(n)

whence

mean(n +1): = sum(: data(n − i).power(i, q) ←i :n +1)/W(n +1); = (W(n).m/q +data(n))/W(n +1); = m +(data(n) −m)/W(n +1)

so, as before, we can use e = data(n) −m as our error term, that we now scale down by W(n +1) instead of n +1, to get the adjustment to m. As before, this gives us data(n) = e +m. So now on to the variance. First observe that our updated mean is m +e/W(n +1) so data(n)'s difference from it is

data(n) −mean(n +1) = e.(1 −1/W(n +1)) = e/(1 +q/W(n)); and
s = sum(: square(data(n −1 −i) −mean(n)).power(i, q) ←i :n)/W(n)

whence

vary(n +1): = sum(: square(data(n −i) −mean(n +1)).power(i, q) ←i :n +1)/W(n +1); = (W(n).s/q +square(e −e/W(n +1)))/W(n +1); = s +(e.e/square(1 +q/W(n)) −s)/W(n +1)

in which we can compute 1 +q/W(n) and e before updating other variables and the rest after updating W, n and mean.

We thus have e / W(n +1) as our increment to mean and the slightly more complicated second term just given as our increment to vary. Again, incremental update is possible and we only need to store n, m, s and, maybe for the first few iterations, W(n).

Written by Eddy.