Compilers
Writing a compiler
Go 1.5 shipped in 2015 with one notable fact: the Go compiler is written in Go and compiles itself. That is not just a technical achievement, it is a philosophical moment: the language reached self-sufficiency. Writing a compiler that compiles itself is the peak of compiler engineering, reached by Go, Rust, TypeScript, and GCC. This lesson is about how to get there.
- **TypeScript tsc**: the TypeScript compiler is written in TypeScript and type-checked by itself. Adding a new typing feature requires the compiler's own code to use it correctly
- **Go bootstrap in 45 minutes**: `GOROOT_BOOTSTRAP=/usr/local/go1.4 go build` reproduces the full bootstrap process. It runs regularly in CI to verify the compiler
- **csmith differential testing**: automatically found 300+ bugs in GCC, Clang, MSVC, and ICC by generating random C programs and comparing outputs. No tests written by hand
End-to-end compilation
A compiler is a pipeline: source code -> lexer -> tokens -> parser -> AST -> semantic analysis -> IR -> optimizations -> code generation -> machine code. Each stage transforms the representation by adding information (types, scopes, registers). An error at any stage should produce precise diagnostics.
The Tiny C Compiler (TCC) is a single-pass C compiler in around 15,000 lines: lexer, parser, and code generation all in one pass without an AST. It compiles 10x faster than GCC but the output runs slower. That illustrates the trade-off: separation of concerns (several passes) vs compilation speed (a single pass).
Why do most modern compilers use multiple passes instead of one?
Language design decisions
Language design decisions determine the complexity of the compiler. Static vs dynamic typing: static enables better codegen and IDE support, dynamic offers flexibility. Garbage collected vs manual: GC is easier for the user, manual provides predictable latency. Whitespace-significant (Python) vs braces (C): affects only the lexer. Expression-oriented vs statement-oriented affects AST design.
The Zig language (Andrew Kelley) has comptime: compile-time computation in the same language as runtime. That simplifies the compiler: one interpreter for both phases. The TypeScript type system is Turing-complete (proved in 2021), which means some type checking tasks are algorithmically undecidable. The compiler adds a recursion depth limit as a safeguard.
Why is adding generics to a language a non-trivial compiler task?
Compiler testing
Compiler testing is harder than testing ordinary software: you have to verify codegen correctness, language semantics, parser edge cases, and performance on large files. Approaches: unit tests for individual phases, snapshot tests (golden files), fuzzing, differential testing against another implementation.
TypeScript has around 60,000 conformance tests. The Rust test suite runs about 30 minutes on CI. The GCC test suite (dejagnu) holds 100,000+ tests built up over 35 years. Fuzzing (libFuzzer + afl++) finds parser crashes. In 2021, 200+ bugs were found in LLVM through continuous fuzzing in OSS-Fuzz.
What is differential testing for compilers?
Bootstrapping: the compiler compiles itself
Bootstrapping is the process of writing a compiler in the same language it compiles. The start: write a compiler for language X in language Y (usually C). Then rewrite it in X. Build it using the Y-based compiler. Then build itself with its own compiler. Rust, Go, GCC, and TypeScript are all bootstrapped.
The TypeScript compiler (tsc) is written in TypeScript and compiled by tsc. That means the source of the TypeScript compiler is valid TypeScript that type-checks. When TypeScript adds new typing features, the team writes them in JavaScript first and then migrates to the new TypeScript feature. GHC (the Haskell compiler) is one of the most complex bootstrapped compilers: it takes 40+ minutes to build.
Bootstrapping is impossible without another language. You need a 'first' compiler in assembly
Bootstrapping just needs a starting point. The first compiler can be a minimal interpreter or written in another language. From there the compiler extends itself
Go 1.0 was written in C. That was the starting point. Go 1.5 replaced the C implementation with a Go one compiled by Go 1.4. Rust was written in OCaml (rustboot) before the first Rust-on-Rust bootstrap. The history of every language is a chain of bootstrap steps
What is 'stage2' in the Rust bootstrapping process?
Key ideas
- A compiler is a pipeline of independent phases: lexer -> parser -> semantic -> IR -> optimizations -> codegen. Each phase adds information
- Language design decisions (typing, evaluation strategy, generics) directly determine the complexity of each compiler phase
- Bootstrapping: the compiler compiles itself. The stage2 check verifies codegen correctness through idempotence
Related topics
A compiler ties together every compiler concept:
- Writing an interpreter — An interpreter is the simpler first step. A compiler adds code generation on top
- LLVM infrastructure — Most modern languages (Rust, Swift, Julia) use LLVM as the compiler backend
- Compiler testing — Compilers require specific testing methodologies: fuzzing, differential testing, conformance tests
Вопросы для размышления
- Thompson's Trust attack (Ken Thompson, 1984) shows that a bootstrapped compiler can hide a backdoor invisible in the source. How do reproducible builds help defend against it?
- TypeScript adds new typing features in tsc, written in TypeScript. How does the team develop a feature before the compiler supports it?
- Rust's stage2 check ensures two passes produce identical binaries. What can cause a stage2 mismatch, and what does it say about the compiler's correctness?