What follows is an outline of what I did to Rod Cook's MUM segmentation
program while I was working for NA
software: it's here mostly to illustrate how I go about evolving a software
project.  Rod's MUM algorithm is a way of decomposing into regions an image
which is corrupted with coherent speckle: each region
having a suitably uniform visual texture
 (e.g. a wood, a field, a
hedge-row, a lone tree), give-or-take the limitations of how much information
speckle has destroyed or the statistics used have lost.  I'll not be discussing
the image processing aspects of this beyond saying that MUM stands for Merge
Using Moments
 – it starts with lots of tiny regions and repeatedly
chooses a region to merge into a neighbouring region with suitably-similar
statistics. 
The existing definitive expression of the algorithm was a piece of ForTran written by an expert in image processing – my boss, Rod Cook. I got to convert it to C without worrying too much about what the image processing meant, just so long as I mimicked it right. Although the context called for an ultimate goal of doing segmentation much faster than anyone could at the time – in real time, as collected by SAR and deconvoluted by my colleagues' code – I deliberately left speed on one side and set about writing an implementation with straightforward modules interacting with one another via clean APIs.
 I knew about (what computer scientists describe as) topology (but
mathematicians describe as homotopy) and data structures to model it, so duly
wrote a library to describe abstract 2-dimensional space in such terms: this
managed all the business of telling me which regions were neighbours and of
merging each iteration's pair of regions into one another.  Statistical data
about the image then constituted the geometric
 data about each region.
That conveniently bundled up all the mundane book-keeping tasks into a library
with a straightforward API: what remained was to express Rod's algorithm in C
using that tool-kit.  This allowed for a good degree of abstraction in the
expression of the algorithm (and blessed was gcc, for it did
provide extern inline) – which made debugging it a lot easier
than debugging the ForTran. 
 [Separating out the statistical details also proved a boon: it
enabled us to explore different varieties of statistical data and how they might
be used to measure similarity
 between adjacent regions.  Thus, where
Rod's original code had computed the moments
 (mean, variance and their
kin) of the data, we were able to explore various notions of visual
texture
 (which take account of how image intensity varies with position,
within each region).  These enabled us to distinguish (for instance) between
woods and fields, even when their average brightness
 was the same.
Exploring variations on this theme was made practical by judicious separation of
the statistical analysis from the topological analysis.] 
In the first instance, I implemented the algorithm directly as Rod had coded it in the ForTran: while doing so, I noted a glaring issue in the form of a sorted list of all live regions. All we did with the list was, at each iteration, take the top two items off, merge them (possibly reposition some of their neighbours) and insert the result wherever it belonged in the list. Since we only needed to know which was best, at each iteration, it seemed wasteful to be keeping the whole list strictly sorted (we only need the upper reaches, as it were) especially as the amount of effort in repositioning an item was always of order the length of the list. Still, that's how I did it first time round, because it was how the ForTran did it and it'd be easier to debug a direct mimic.
I'd left speed on one side, so it wasn't fast. But it was robust and made from well-structured components, so some fairly straightforward tweaking got it going about as fast as the ForTran (depending on image size, but only marginally). So then we set to looking at speed issues properly.
Each merge involves some statistical computation for each neighbour of the merged region and we intuited that the typical number of neighbours, over the entire history of merging, would grow about as fast as log(area). Each merge reduces the number of regions by one, so the number of merges is proportional to the area of the image. That told us to expect that that the time to segment an image would grow, with the area of the image, as roughly area.log(area). Note that neither of us knew much about sorting algorithms and speed at the time.
 My implementation, like the ForTran, was taking time proportional to the
square of the area.  After a little inspection of the system I realised that the
list sorting we were doing would take a square of area
 time, so the
book-keeping was costing more than the image-processing.  Remembering my earlier
concerns about the sorted list, I went and read Knuth's chapter on Sorting
and Searching
.  That gave compendious knowledge of how to do the job in
hand, and several others besides. 
 This is when I got pay-back from those clean APIs between modules: I only
had to change the limited amount of code concerned with the sorting, which was
well-isolated from everything else.  Using a variety of sorted binary tree, I
could reduce the list management to times proportional to log(area) per
insertion (used during initialisation of the data-structures) or removal (of
which each merge involves one).  Each merge also involves rattling
 the
new merged node from its position at the top of the tree to wherever it now
belongs, shuffling others upwards in the process: that again grows as log(area),
the height of the tree.  The time per merge thus contains an image-processing
term proportional to how many neighbours the merged node has and three
book-keeping terms proportional to log(area).  Experiment found that the result
overall was roughly proportional to log(area) per merge: since the statistical
computation per neighbour per merge is of similar size to the book-keeping of 1
(or perhaps 3) rattles, the typical number of neighbours must be rising no
faster than proportional to log(area) – our intuited rate. 
 [Time will also be spent rattling
 some neighbours of
the merged node (the others
 alluded to above): each of these neighbours
either saw one of the regions merged as my best neighbour
 or subsequently
see the merged region as such.  Since the two regions merged were as similar as
we could find, the new region should be likewise similar to them, so typically
about as similar to any of their neighbours as was the relevant old region.
Consequently, these adjusted nodes will typically only change how good is my
best merge
 by small amounts – so I would not be too surprised if their
rattling typically does nothing and almost never does more than one step.  In
any case, when multiplied by the number of such re-sorted nodes, this
rattling-cost would appear to come to at most log(area) – so if the
typical rattling-cost does grow significantly, we would need to suppose that the
number of re-sorted neighbours grows correspondingly less rapidly than
log(area).] 
 That gave us a fast enough segmentation program that we could realistically
look at doing the job in real time.  The first step towards that was to port the
program to the parallel computers used for the real-time deconvolution of the
SAR data.  This proved trivial – the only code change needed was to elide
the call to a Sun-specific timing routine which I'd been using (in
main) to measure run-times: writing portable code comes naturally
when building a program from robust well-structured components.  None the less,
the ported code was unacceptably slow, so we profiled it – and discovered
that the culprit was the system free() on the target – the
second call (and all subsequent calls) to free() would perform
heavy-duty processing designed to tidy up the free-space map. 
 Again, the clean interfaces saved the day.  Each datatype used in the
program had its own creation and destruction routines: so there were relatively
few places where malloc() and free() were called,
making it easy to replace these with calls to a custom memory manager my
colleague Ian knocked together for the job.  For this we could compute, before
allocating any of our data, how many data-structures of what sizes we would be
needing (the algorithm is well-behaved in this respect): for each size we could
then allocate a single sufficient raft of memory and initialise this as a linked
list of free nodes of that size; the allocation and release of items from this
list is then easy to code and has no memory overhead.  Each datatype's creator
and destructor could then call the correct size's allocator or releaser –
and the use of clean interfaces had guaranteed that we only had small amounts of
code to change.  The resulting program was fast enough that we could perform a
demonstration of real-time segmentation: we had to use a somewhat coarse grid on
the image in the first instance, but could confidently point to Moore's law to
give our clients the speed-up needed to use a finer grid within a few years
– where, previously, the task of real-time segmentation had seemed
unapproachable. 
It should be noted that MUM wasn't the best segmentation algorithm available – some neural net and simulated annealing methods were clearly better, and our clients had an algorithm of comparable quality to MUM which ran about as fast as Rod's implementation of MUM. Moore's law and improvements in neural and annealing technique duly ensured that, within five years or so, MUM was swept aside by better algorithms becoming feasible in real-time. Sic transit gloria mundi.

 Written by Eddy.
Written by Eddy.