Compilers
How a Computer Reads Code
John Backus and the First Real Compiler
In 1957, John Backus's team at IBM shipped the first FORTRAN compiler. Before that, every program was hand-written in assembly. Colleagues skeptically called the project 'word games' - they did not believe a machine could generate code as well as a human. The FORTRAN compiler proved them wrong: the generated code was nearly as efficient as handwritten assembly. Backus also invented BNF notation (Backus-Naur Form) - the standard way of writing programming language grammars.
Цели урока
- Understand that source code is a stream of bytes and how encoding determines character mapping
- Explain the lexer's role: grouping characters into tokens while tracking position
- Distinguish Parse Tree from AST, build an AST for arithmetic expressions
- Understand how codegen translates AST into machine instructions and the role of the linker
Every Python line, every React component, every Rust backend request - all pass through the same pipeline. LLVM compiles 60+ languages. V8 JIT-compiles JavaScript into native code 100x faster than interpretation. Behind all that variety sits one architecture: characters -> tokens -> AST -> IR -> machine code.
- **V8 (Google Chrome, Node.js)** - JIT-compiles JS through AST -> Ignition bytecode -> TurboFan machine code. The same pipeline as GCC
- **LLVM** - backend that compiles Rust, Swift, C++, Julia, Kotlin/Native. 60+ frontend languages, one IR
- **TypeScript compiler (tsc)** - complete pipeline: lexer -> parser -> type checker -> emitter. 100K+ lines of TypeScript
- **Python bytecode** - CPython compiles to .pyc (bytecode) executed by the VM. `dis.dis(func)` shows the result
- **ESLint/Prettier/Babel** - all operate on AST transformations. An ESLint plugin is a visitor over AST nodes
Предварительные знания
Source Code as a Stream of Characters
**LLVM compiles Rust, C++, Swift, Julia, and 60+ other languages.** Different syntaxes, different semantics - one pipeline. Because at the input, all these languages look identical: a stream of bytes. `x = 42 + y` for the compiler is `78 20 3D 20 34 32 20 2B 20 79`. No variables, no operators - just numbers. The job of the first phase is to reconstruct **structure and meaning** from that raw stream.
**Encoding** determines how characters map to bytes. ASCII uses 1 byte (128 characters). UTF-8 uses 1-4 bytes (the full Unicode range). Python 3, Rust, Go, and Swift all use UTF-8. GCC supports files in other encodings via `-finput-charset`.
| Encoding | Character size | Range | Example |
|---|---|---|---|
| ASCII | 1 byte | 128 characters | A-Z, a-z, 0-9, +, =, ; |
| Latin-1 | 1 byte | 256 characters | ASCII + accented characters |
| UTF-8 | 1-4 bytes | 1,112,064 characters | Everything: CJK, emoji, Cyrillic |
| UTF-16 | 2-4 bytes | 1,112,064 characters | Used internally by Java and JS |
The compiler reads a file **sequentially**, one character at a time, left to right, top to bottom. It tracks the current **position** (line:column) for error messages. When seeing `error at line 42, column 15` - that is the position tracker inside the lexer doing its job.
The BOM (Byte Order Mark) is the invisible character `U+FEFF` that some editors (Notepad on Windows) silently prepend to UTF-8 files. Many mysterious 'the code looks correct but won't compile' bugs are caused by invisible Unicode characters.
Why does the compiler need to know the position (line:column) of each character?
Tokens: the Atoms of a Language
**Lexical analysis (lexing)** is the compiler's first phase. The lexer groups characters into **tokens** - the smallest meaningful units of the language. Clang and GCC hand-write the lexer using switch/case: regex gives the right result, but production compilers need throughput to process millions of lines per second. Tree-sitter, used by VS Code for code navigation, is an incremental lexer - when a single line changes, only the modified fragment gets re-tokenized.
| Category | Examples | Regex pattern |
|---|---|---|
| Keyword | if, else, while, return | Fixed list |
| Identifier | x, myVar, calculate | [a-zA-Z_][a-zA-Z0-9_]* |
| Number | 42, 3.14, 0xFF | [0-9]+(\\.[0-9]+)? |
| String | "hello", 'world' | "[^"]*" |
| Operator | +, -, *, ==, != | Fixed list |
| Delimiter | (, ), {, }, ; | Single characters |
| Whitespace | space, tab, newline | Usually discarded |
**Rule order matters!** The keyword `if` must be checked *before* the general identifier pattern `[a-zA-Z_]\w*`, otherwise `if` will be recognized as an ordinary identifier. Use `\b` (word boundary) for keywords so that `iffy` is not tokenized as `if` + `fy`.
How many tokens will the lexer produce for the expression `if (x >= 10)` ?
AST: the Expression Tree
A list of tokens is flat. It does not capture **structure**: the fact that `*` executes before `+`, or that `if` controls a block. The **parser** builds an **AST** (Abstract Syntax Tree) from tokens - a tree that reflects the hierarchy of expressions. ESLint, Prettier, Babel, and the TypeScript compiler all operate on AST transformations. Rust proc-macros receive an AST as input and return an AST as output.
**Operator precedence** is encoded in the tree's structure: `*` sits deeper than `+`, so it is evaluated first (leaves before root). Parentheses like `(2 + 3) * 4` change the tree's shape: `+` is pushed deeper than `*`. In Python: `import ast; print(ast.dump(ast.parse('x = 2 + 3 * 4')))` shows the real AST in one line.
The AST of real code is always inspectable: in Python `import ast; ast.dump(ast.parse(...))`, in Rust `cargo expand`, in Clang `clang -Xclang -ast-dump -fsyntax-only file.c`.
What does the root of the AST look like for the expression `(a + b) * c`?
From AST to Machine Code
The AST is the last human-readable representation. From here the compiler translates it into **machine code** - a sequence of instructions the CPU executes directly. GCC spends roughly 60% of compilation time on backend phases: register allocation, instruction selection, scheduling. LLVM extracted all of this into a reusable library - that is why Rust, Swift, and Julia get the same optimization quality for free.
| x86 instruction | Action | Python equivalent |
|---|---|---|
| mov eax, 42 | Load 42 into register eax | x = 42 |
| add eax, ebx | eax = eax + ebx | x = x + y |
| imul eax, ecx | eax = eax * ecx | x = x * y |
| cmp eax, 0 | Compare eax with 0 (set flags) | if x == 0 |
| jne label | Jump to label if not equal | if ... : goto |
| call func | Call a function | func() |
| ret | Return from a function | return |
Real code generation is far more involved: calling conventions (which arguments go in which registers), memory alignment, ABI compatibility, and exception handling all matter. GCC spends roughly 60% of its compilation time on backend phases.
AST and Parse Tree are the same thing - just different names
A Parse Tree (concrete syntax tree) contains ALL grammar rules, including parentheses, delimiters, and intermediate non-terminals. An AST (abstract syntax tree) contains only operations and values.
The Parse Tree for `(2 + 3)` includes nodes for the parentheses and intermediate grammar rules (expr -> term -> factor). The AST contains only `BinOp(+, 2, 3)`. Compilers build and work with ASTs because they are more compact.
Why is a linker needed if the compiler has already produced machine code?
Key Takeaways
- Source code is a stream of bytes. Encoding (UTF-8) defines how characters map to bytes
- The **lexer** groups characters into **tokens** - the smallest language units. Clang/GCC hand-write the lexer for speed; Tree-sitter makes it incremental for IDEs
- The **parser** builds an **AST** from tokens. Tree structure encodes operator precedence. Parentheses are not needed in the AST - they are already captured by tree shape
- **Codegen** translates the AST into machine instructions. The **linker** assembles object files by resolving cross-module references
- LLVM extracted the entire backend into a reusable library - 60+ languages get the same optimization quality
- Parse Tree retains all grammatical details; AST retains only operations and values. Compilers operate on ASTs
Related Topics
Each stage of the pipeline is covered in its own lesson:
- Compiler Architecture — How passes and the pipeline are organized, and how code flows through them
- Lexer Fundamentals — A complete lexer implementation with error handling, strings, and comments
- Context-Free Grammars — The formal rules by which a parser builds an AST
Вопросы для размышления
- Does the lexer split `>=` into one token (GTE) or two (`>` and `=`)? How does the lexer decide? Hint: the maximal munch rule.
- Could a compiler skip the AST and generate machine code directly from tokens? What limitations would that impose?
- Why do real compilers (GCC, Clang) implement lexers by hand instead of using regex? Hint: speed and context handling.
Связанные уроки
- comp-03-compiler-architecture — Passes and pipeline - next level after basic phases
- comp-06-lexer-basics — Complete lexer implementation with error handling
- comp-10-cfg — Formal grammars underpin every parser
- fl-02-chomsky — Chomsky hierarchy explains why lexer (regex) and parser (CFG) are separate
- arch-05-instruction-cycle — Machine code produced by codegen runs on the CPU instruction cycle
- comp-33-v8 — V8 is a production example of JIT built on this same pipeline
- fl-01-intro