Compressing to printable pure-ASCII

Compression programs take files and make them smaller, generally producing a binary file that uses the full range of values for the output bytes. Such files are not always suitable for transmission over a network and one can't usefully print them out. There are other situations where using the full range of values open to bytes is undesirable. Generally, one handles this (e.g. in e-mail attachments) by using some encoding which reduces the file to purely printable ASCII; most commonly, base 64 encoding. This is easy to do – every three bytes of the input yield four characters of output, in a way that's computationally easy to perform – but somewhat inefficient, since there are 100 printable ASCII characters, six of which are spaces.

In practice, it makes sense to leave out the six space characters: various applications treat them specially and may mess with them; and some applications don't like to see lines of length greater than some limit, so it's desirable that we be able to insert newlines at arbitrary points in our compressed file and ignore them when uncompressing. It's also desirable that we be able to print out a compressed file and recover it thereafter; the form-feed, carriage return, and vertical tab have special meaning to printers which would mess this up; the horizontal tab is apt to do the same and, when it doesn't, to be mistaken for a plain space. We could keep the space character, but some applications strip trailing space at the end of lines; and leaving it out ensures that simply inserting newlines at regular intervals in the file produces something that, in a fixed-width font, prints out with a neat right margin, making it easy to recognise variation in line-length as corruption. We are thus left with 94 characters.

One could, of course, construct a base 94 encoding; however (since 94 isn't a power of two) this would be somewhat clumsy to implement. It would be far better to have compression programs actually know to target the reduced character-set and use it as efficiently as possible. At output, we can add newlines at regular intervals; and, optionally, form-feed characters (which tell printers to start a new line) at regular intervals after newlines so as to make printing work nicely. The implied file format would be related to the usual form of the compression program, but different enough to warrant a distinct file extension and MIME type. It would thus be interesting to see how readilly various existing compression programs could be adapted to support a common command-line option

--ascii[=cols[,lines]]: Use alternate ASCII-armoured output format. When compressing, if =cols is specified, insert a newline after each cols characters of output unless cols is zero (in which case no newlines are inserted); the default is to act as if cols were given as the value of the COLUMNS environment variable, if set, else 72. If ,lines is also given, add a form-feed after every lines lines of output; by default, no form-feeds are added. When uncompressing, [=cols[,lines]] is ignored if given; all spacing characters in the input are ignored.

Written by Eddy.