Architecture for an editor

I have been an emacs user for nearly (in 2015) a quarter century now: my spinal column knows how to drive it without help from my brain. There's support in emacs for all manner of forms of data and ways of displaying it, for fetching data from (and storing it to) a marvelous assortment of resources via various protocols – for example, the ange-ftp mode of emacs enabled me to treat the entire internet (such as it then was) as my file-system way back in the early '90s. I can say much more in favour of emacs – yet it's not how I'd do the job if I started from scratch today (this was true in 2003, when I first wrote this, and remains in 2015).

I could also say ample against emacs, most of it rants about the defects of elisp as a configuration and extension language (for an example, if you're an emacs user, load up SGML-mode and open a file with its HTML-mode: read the documentation for sgml-tag-alist and skeletons (if you can find the latter) and see if you can make any sense at all of what they're saying; try to match up that documentation with the value of html-tag-alist; chose an enhancement to the way the mode works (e.g. add lang="en" to the HTML element when inserted; or put a newline after each LI inserted by an OL or UL – I actually managed the latter, but the former defeated me) and see if you can work out how to implement it). But this is neither the time nor the place for that.

This has lead me to think about what an editor should be, how it should be cut up into parts, what those parts are and how they should work together. The parts should be so specified as to allow modular replacement – i.e. different suppliers of software competing in each of the niches implied, via an open standard that serves the consumer. The result, inevitably, intrudes a long way into what the operating system should be, to support the editor; indeed, I leave no crisp boundary between which parts are the editor and which parts are the operating system; some components integral to the editor will obviously be integral to other tools, so we might call them parts of the operating system – but then the editor is a pretty crucial part of the operating system, and some components of the editor may later emerge as crucial components of non-editor tools as yet un-dreamed of.

The Operating System

While I'll generally presume that the operating system is something essentially kindred to GNU/Linux in its form and foundation, it is worth noting that, from the point of view of the generality of users, the essential parts of an operating system are (enough infrastructure to let them log in and have this be a secure process if they so desire, and) graphical tools to

communicate easily with peers (mail client, chat, instant messaging),
read what others have published (web browser) and
create and modify documents (editor, word processor, authoring tool, or what you will).

Specialist users will, naturally, have other needs: but these are the basics (as identified in an interesting article I read early in 2003) everyone needs. The given functionalities may be supplied by one program; or by several that cut across the classification above (e.g. my mailer is part of my editor; many folk use mailers embedded in their web browsers); but the user must have easy access to these three kinds of functionality.

Users should be able to specify private mime.types files, in a potentially cascading chain (embracing one shared with peers and a system one, for example); likewise, it would be nice to have a generic mechanism for specifying key-bindings so that the user can use any participating application without having to find out what its particular keyboard short-cuts are, and without having to configure each application to understand the given user's key-bindings.

Files: octet streams, encodings, grammars, semantics

Fundamentally, following the Unix tradition, a file is a sequence of octets (a.k.a. bytes); it may be stored on disk, it may be a stream coming from a physical device (with or without storage as its source there), it may be a stream going to a physical device, it may be a sequence of states of an oscillator, it may be many things, but it is a sequence of octets. The means by which one accesses a file (or byte-stream, or whatever you want to call it) include the means to identify how those bytes are to be understood. There may be several layers to that understanding: each layer builds upon the one below, but the octet-stream is the foundation on which all else is built.

The first layer of understanding transforms the byte-stream into a sequence of tokens understood as characters. In ASCII, each byte is itself a character (with a redundant bit, which may be used as parity bit to enable a simple error-detection to detect mis-transmission); in ISO 8859's various encodings, each byte encodes a member of a collection of characters, with each such collection typically sufficing to describe the writing system of some particular culture; in Unicode, more sophisticated encoding is used to facilitate expression of every glyph known to any language whose users have yet taken it into their heads to ask to be included in the catalogue. For an octet stream to be useful, it needs to come with some (possibly implicit) information about how it is to be transformed into characters – this information is known as its encoding.

This should not be confused with encryption or compression, which may be used to package the octet-stream in some form from which the octet stream may be recovered. For my purposes, an encrypted or compressed octet stream is a separate octet-stream whose encoding, grammar and semantics are prerequisites of unpacking the encryption or compression to obtain the packaged octet-stream. Since such packaging need not be applied to an octet-stream – it can as readily be applied to a sequence of characters, lexical tokens, grammatical productions or, at least in principle, semantic atoms – it is best understood as a semantic layer whose comprehension yields the content it packages (which must then be parsed and, in its turn, comprehended).

The resulting sequence of characters is then classically understood as a text in some language conforming to a grammar characterized by various patterns. A sequence of characters matching a suitable pattern gets construed into a single entity, a fragment of text characterized by the pattern. A sequence of fragments of text matching a pattern (now attending to the type of pattern characterizing each of the text fragments) gets construed as a single entity likewise; and these likewise serve as constituents in larger text fragments and so on, up to the entire document. [Contiguous sequences of letters in the text you are now reading are construed as words; punctuators group these words into clauses and the clauses into sentences; HTML mark-up groups the sentences into paragraphs; and so on.] This pattern-matching process is known as parsing and provides a decomposition of the document into a hierarchy of sub-texts associated with one another by the patterns by which they are combined and individually characterized by the pattern their constituents matched to form them.

The classic formalism of parsing goes via a layer called lexical analysis, which decomposes the sequence of characters into lexical tokens or lexemes (roughly filling, in formal grammars, the rôle of words in the grammars of European and similar languages). The resulting sequence of lexemes is then analyzed to establish its grammatical structure. While this division is often convenient for those implementing parsers, it is so mainly because the designers of the languages to be so parsed have deliberately designed languages to be amenable to some such low-level simplification; and the two phases of parsing are commonly designed and specified as a single whole. I shall, thus, treat lexical structure as part of grammar; lexical analysis as part of parsing.

Thus a file, interpreted via an encoding as a character sequence, needs an associated grammar to characterize the structure of the character sequence. The grammar is standardly communicated via a MIME type; in the case of some MIME types, the grammar is enriched by a further specification of patterns to be matched to refine the hierarchy implied by the MIME type (thus an XML document also has a schema or a DTD to specify more detail about its grammar: matching the patters of the plain XML grammar is described as well-formedness, but matching the DTD is a stronger condition – I think it's called conformance – and the DTD thereby conveys richer information). For many MIME types (specifically, those which do imply a hierarchical grammar) it should be possible to specify a reversible transformation between any valid document of the given type and an XML document conforming to some DTD associated with the type; consequently, I generally treat all grammars as reducible to XML and a suitable DTD.

I'm here mainly concerned with text documents: for an audio or video stream, there are analogous stages of processing on the way from an octet stream to some representation amenable to play-back or display. That may involve an intermediate format, such as a family of MIDI streams for audio, that might be more amenable to editing. I'm not primarily thinking about such data types, though, so I shall leave the audio-visual afficionados to think about how much, if any, of my analysis is applicable to their work. Still, these days, even text documents tend to come with embedded images or audio-visual material, so I cannot ignore this entirely.

Finally, though usually integrated with the process of parsing, there may be some layer of making sense of the structured text obtained by parsing the sequence of characters – this will generally be some form of processing or analysis of the data taking into account its meaning. (Note that this is where there is information – separated from the data by a MIME type, an encoding and whatever is making sense of the result of parsing.) As noted above, the sense made of the result of parsing may, itself, merely be to interpret it as the result of compressing or encrypting a further sequence of octets: in this case, the system making sense of the parsed data may simply transform it into a fresh byte-stream (or character stream, or even a parse tree). In any case, the thing which makes sense of the result of parsing may reasonably be presumed to always be a program (though a family of inter-operating programs may all interpret the parsed data the same way) interpreting the document in a manner typically specified with the MIME type, DTD or similar document; this is known as the semantics of the document.

The internet client daemon

The file-system should provide a service that amounts to the internet – a virtual disk which is managed by a daemon which talks the protocols (as client), handles caching, transforms file opening into suitable-protocol requests, etc.; this should not be a piece of functionality of any particular application (editor, web browser, helper application launched by either) but a service shared by them all.

Its handling of caching issues requires that applications have some simple way of telling it when they take an interest in a file and when they lose that interest (rather than the web browser launching a helper application, then either keeping the cached document for ever, or flushing it from cache before the user has finished viewing it).

The daemon will need to understand, if using a local proxy, which resources it can rely on the proxy to provide (e.g. when several colleagues read some cartoon daily, some shared machine should cache it in a proxy, rather than each of them caching the image locally; but when I book an airline ticket, my local daemon should record the response I got from the server and not request it again unless I actually click on the relevant button of the page that accessed it). Note that this daemon is not a proxy in the terms of the W3C specs: it is (in those terms) part of the user agent. As such, it needs to be able to distinguish between a request to re-fetch and an application merely wanting to re-display what was last fetched; yet, it probably should also know about caching aspects of the protocols, so that it will transparently fetch any resource for which that is appropriate, even without being asked to re-fetch. It also needs to be capable of recognising that POST, PUT and DELETE actions on a resource may invalidate any cached value for it; in particular, user applications should let it know when they take such actions.

The editor

The document exists, within the editor, in two forms: the buffer contains the byte-stream (and may well carry much of a parse tree with it); this may be displayed via several portals (each of which may, in turn show up in several windows, to which I'll return). Each portal has a way of understanding the buffer – for example, if the buffer's contents are HTML encoded in Unicode, one can display the raw bytes, one can display the text they encode (but still show the HTML tags and so on) or one can display the web page expressed by the HTML document. There may also be developer-mode portals, such as a view of an HTML document in which selection of a particular rule in an associated style sheet causes elements affected by that rule, or a view of source code in which activity in a debugger session (in its own portal) will highlight the currently active code and permit interaction with variables in the code to display their current values.

Each portal has its own history of significant positions recently visited (in emacs parlance: point-and-mark chain); and each may be read-only or mutable, independently of other portals on the same document (though having only one mutable portal may be prudent). All portals on a document share one copy of the document, however; one operating-system resource (file, whatever) might be opened as several documents, though this will tend to create complications and won't be the normal policy. Auto-save policy and strategy, if any, will be a property of the document (not the portal, nor the OS resource).

One portal may be displayed in several windows at a single time – e.g. so that one can scroll to different positions in the portal to view them side-by-side. Indentation and margins are portal properties, although the saved document may have its own properties describing this at each of the various levels of processing from the byte-stream upwards. Paragraph re-flow policy will be a property of the portal; for read-only portals this is reflow-to-display but a mutable portal may have a (potentially separate) policy for where the saved form of the document shall contain line-breaks and how it is indented. Long line truncation-or-reflow policy will be a property of the window, but the portal may provide default settings for this on window-creation. For an HTML document, in a text-level portal showing the HTML tags the indentation and line-break policies are matters of actual line-breaks and spacing at the start of each line. For an HTML-view of the same document, they'll be CSS properties, with all the complications that implies for how they are expressed via CSS (on a local element, on all elements mtching a given selector; in the HTML file, or in which of the cascade of imported CSS resourcs).

Most of the important sophistication of the editor will focus on portals. As emacs has editing modes, so edyt must have classes of portal. I believe the kindest way to let users re-configure portals to match user preferences is to use python (or something very similar) as the configuration language, with each type of portal relating to a particular class derived from a portal base-class. Customizing a portal-type would then be achieved by sub-classing an existing portal-type and having the sub-class tweak relevant attributes and behaviours.

Valid CSS ? Valid HTML ? Written by Eddy.