Compiler Ancestry

While compiling the compiler the other day on this very server, I had some time to think about compilation. The server's running Gentoo Linux, a BSD-like distribution that exercises the compiler more than most; all software packages and updates are delivered as source code.

For the uninitiated (you lucky rascals), a compiler is a computer program that translates human-writable commands (a.k.a. "source code") into computer-readable instructions (a.k.a. "machine language"). Source code is written in a variety of programming languages, most of which look like over-punctuated algebra for dimwits. Machine language is nothing but strings of ones and zeros, intelligible only to the central circuitry of the computer and spectacled and suspender'd 70s-era code jockeys who fancy this skill impressive.

As with any computer program, compilers themselves must be compiled into machine language before the computer can run their instructions. It's a little bit chicken 'n egg: you need a compiler to build a compiler.

Of course, just as we have chickens, we have compilers, so this is somehow possible. The way that it's done is that an earlier version of a compiler is used to compile the source code for the most recent version (which has new features, performance enhancements, bug fixes, etc.). There's always a second step though: the newly-compiled version must then compile its own source code once again -- so that it incorporates all of its own improvements.

And if that weren't enough, compilers can also jump across different computer architectures -- from a Sun UltraSPARC to an AMD Opteron, for example -- through a technique called "cross compilation". In this case a compiler runs on one architecture and emits the machine language instructions for a different target platform. The compiled machine language is copied over to the target machine where it then can be used to recompile its source code all over again on the target architecture. In analogy-speak, our chicken lays a turkey egg.

The upshot of all this is that each compiler is built from its ancestor, which is built from its ancestor, and so on. It almost seems possible to trace the chain all the way back to the Adam Compiler.

For instance, I'm running GCC, a compiler for the C programming language, version 3.4.6-r1 on the server. There is a history of GCC releases that traces back to version 0.9 in 1987. Now that's where the GCC lineage ends, because the first versions were written by Richard Stallman, ostensibly from scratch since his GNU Project abides nothing but free (as in speech) software, and there were no free C compilers at that time. Bootstrapping a new compiler out of thin air is a "non-trivial" task.

What chain of versions leads back from my compiler to Stallman's initial release 19 years ago? Which people ran make on which computers with which versions of GCC? It must've been a pretty tortuous path from the halls of MIT to this Yukon yokel.

Comments