A Heinous Hack

Sometimes, in software, one must do something rather strange to solve a problem. Ingenious solutions to problems tend to get called hacks; and they tend to be heinous in one way or another – which is part of their charm. (One of the finest examples of the genre is Duff's device; the perpetrator's ambivalence about it is a standard feature of any truly heinous hack.) So here is an account of one such hack I perpetrated, to assist my friends in the Qt project's documentation team.

We have a tool called QDoc which scans source code to generate documentation; in so doing, it needs to make enough sense of the software and the comments embedded in it to work out what data types and routines make up the public interface and to associate each with a comment that describes it, warning us of any mismatch (public interface that isn't documented, or documentation for a public interface that it can't find). In order to do that it needs to know where to find standard header files that declare system-provided facilities used by programs in the language (C++) of the code. We'd recently reworked QDoc to have clang make sense of the code (saving us the maintenance of a parser for the language), with some code of ours hooked into clang to do the job we want. This all went well until we found that clang needed (back in the old version we were then using) to be told where to find some system header files; and we couldn't (entirely) rely on a fixed location for that. A frustrated colleague turned to me for suggestions and I gave it some thought.

C++ is based on an older language called C and shares, with it, a preprocessor that lets one embed, in code, directives that get replaced in predictable ways; this allows one to (among other things) save repeating the same code in many places. One of these mechanisms is the #include directive used (usually at the top of a file) to refer to header files that declare facilities used by the code in the file. The other is the #define directive, which specifies a word (technically, an identifier; it is composed of letters, digits and underscores and need not appear in any dictionary of any language any folk speak) and a text to be used to replace that word (optionally using, in the replacement, some texts in parenthesis after the word) wherever it appears. When such replacement is performed, if the word being replaced appears in the replacement text, it is (crucially) not replaced again; uses of the word in the replacement text remain in the text finally seen by the program that makes sense of the source code – ordinarily, to turn it into the machine instructions that make up the actual executable you'll run; but, in QDoc's case, to generate documentation from it. The program that does this is known as a compiler (which strictly produces object code; another program, the linker combines a bunch of that into the final executable file of machine instructions). In particular, there are certain special words that the compiler implicitly defines (as if with a #define directive), with well-defined standard meanings. One of these is __FILE__, which expands to (a string literal encoding) the path-name (i.e. the full statement of where the file is on the system, along with its name) of the file in whose text the compiler gets to replace this word.

Now, QDoc itself is a program written in C++, so is compiled to produce an executable program to be run, that in turns compiles other source code to generate documentation (rather than an executable program). The program that compiles QDoc does know where the system headers are; so I just needed to arrange for QDoc to sneak that information off it when it got compiled. My devious idea was to arrange for the source of QDoc to define a carefully chosen word – the name of a function provided by a system header file in the location we needed to know – to expand to some text that included the word __FILE__ so that the compiler, when compiling QDoc, would see __FILE__ in that header and expand it to the path-name of that header, from which I could then extract (with some fairly simple code) the directory name we needed. The core of the hack looks like this:

#define setlocale locale_file_name_for_clang_qdoc() { \
        static char data[] = __FILE__;           \
        return data;                             \
    }                                            \
    char *setlocale
#include <locale.h>
#undef setlocale

It's reolying on the fact that the header file locale.h contains a declaration that looks at least somewhat like (this is the POSIX standard's form of the declaration, so actual implementations have to look enough like it to work the way it's specified to)

char *setlocale(int category, const char *locale);

which, thanks to my #define, will actually get read by the compiler as (give or take some spaces being arranged differently, in ways that the compiler ignores)

char *locale_file_name_for_clang_qdoc() {
    static char data[] = __FILE__;
    return data;
}
char *setlocale(int category, const char *locale);

So the original declaration of setlocale remains as it was (beacuse we don't re-replace the word when it appears in its own replacement) and I've inserted a definition of an (inline) function into the header file; since the compiler sees this as part of the text of the header file, it replaces the word __FILE__ with the path-name of locale.h; my little function then returns that path-name to its caller. The other part of my code calls this function, locale_file_name_for_clang_qdoc(), and can duly find the last directory separator (which is followed by locale.h, the file's name within the directory) and know that the part before that is the directory name we needed.

We initially tried this with two other functions, declared in a different header file, but the trick relies on the function only ever being declared in its header and never used or even redeclared elsewhere in the header; unfortunately, the first two functions we tried got an extra mention, aside from the one we needed. So we needed a singly-declared (and never used) function; and it had, furthermore, to return a pointer to chracter data (the char * return type above; although const char * would have done, too). It took some trial-and-error before one of my colleagues found a function that satisfied these criteria on all the systems where we needed it to work.

We later had to suppress this on Microsoft's systems (Microsoft's compiler, MSVC, had (quite understandable) reservations about defining a function in a context that's really meant to just declare them) and, later still, we were able to retire it when a new version of clang turned out to know how to find those headers for itself (which it might have been able to do before, for that matter; we may just have failed to select the right options to specify for it). It remains a heinous hack of which I'm mildly proud, all the same.


Valid CSSValid HTML 5 Written by Eddy.