Compilers

How a Computer Reads Code

John Backus and the First Real Compiler

In 1957, John Backus's team at IBM shipped the first FORTRAN compiler. Before that, every program was hand-written in assembly. Colleagues skeptically called the project 'word games' - they did not believe a machine could generate code as well as a human. The FORTRAN compiler proved them wrong: the generated code was nearly as efficient as handwritten assembly. Backus also invented BNF notation (Backus-Naur Form) - the standard way of writing programming language grammars.

Цели урока

Understand that source code is a stream of bytes and how encoding determines character mapping
Explain the lexer's role: grouping characters into tokens while tracking position
Distinguish Parse Tree from AST, build an AST for arithmetic expressions
Understand how codegen translates AST into machine instructions and the role of the linker

Every Python line, every React component, every Rust backend request - all pass through the same pipeline. LLVM compiles 60+ languages. V8 JIT-compiles JavaScript into native code 100x faster than interpretation. Behind all that variety sits one architecture: characters -> tokens -> AST -> IR -> machine code.

**V8 (Google Chrome, Node.js)** - JIT-compiles JS through AST -> Ignition bytecode -> TurboFan machine code. The same pipeline as GCC
**LLVM** - backend that compiles Rust, Swift, C++, Julia, Kotlin/Native. 60+ frontend languages, one IR
**TypeScript compiler (tsc)** - complete pipeline: lexer -> parser -> type checker -> emitter. 100K+ lines of TypeScript
**Python bytecode** - CPython compiles to .pyc (bytecode) executed by the VM. `dis.dis(func)` shows the result
**ESLint/Prettier/Babel** - all operate on AST transformations. An ESLint plugin is a visitor over AST nodes

Предварительные знания

What Is a Compiler

Source Code as a Stream of Characters

**LLVM compiles Rust, C++, Swift, Julia, and 60+ other languages.** Different syntaxes, different semantics - one pipeline. Because at the input, all these languages look identical: a stream of bytes. `x = 42 + y` for the compiler is `78 20 3D 20 34 32 20 2B 20 79`. No variables, no operators - just numbers. The job of the first phase is to reconstruct **structure and meaning** from that raw stream.

**Encoding** determines how characters map to bytes. ASCII uses 1 byte (128 characters). UTF-8 uses 1-4 bytes (the full Unicode range). Python 3, Rust, Go, and Swift all use UTF-8. GCC supports files in other encodings via `-finput-charset`.

Encoding	Character size	Range	Example
ASCII	1 byte	128 characters	A-Z, a-z, 0-9, +, =, ;
Latin-1	1 byte	256 characters	ASCII + accented characters
UTF-8	1-4 bytes	1,112,064 characters	Everything: CJK, emoji, Cyrillic
UTF-16	2-4 bytes	1,112,064 characters	Used internally by Java and JS

The compiler reads a file **sequentially**, one character at a time, left to right, top to bottom. It tracks the current **position** (line:column) for error messages. When seeing `error at line 42, column 15` - that is the position tracker inside the lexer doing its job.

The BOM (Byte Order Mark) is the invisible character `U+FEFF` that some editors (Notepad on Windows) silently prepend to UTF-8 files. Many mysterious 'the code looks correct but won't compile' bugs are caused by invisible Unicode characters.

Why does the compiler need to know the position (line:column) of each character?

Tokens: the Atoms of a Language

**Lexical analysis (lexing)** is the compiler's first phase. The lexer groups characters into **tokens** - the smallest meaningful units of the language. Clang and GCC hand-write the lexer using switch/case: regex gives the right result, but production compilers need throughput to process millions of lines per second. Tree-sitter, used by VS Code for code navigation, is an incremental lexer - when a single line changes, only the modified fragment gets re-tokenized.

Category	Examples	Regex pattern
Keyword	if, else, while, return	Fixed list
Identifier	x, myVar, calculate	[a-zA-Z_][a-zA-Z0-9_]*
Number	42, 3.14, 0xFF	[0-9]+(\\.[0-9]+)?
String	"hello", 'world'	"[^"]*"
Operator	+, -, *, ==, !=	Fixed list
Delimiter	(, ), {, }, ;	Single characters
Whitespace	space, tab, newline	Usually discarded

**Rule order matters!** The keyword `if` must be checked *before* the general identifier pattern `[a-zA-Z_]\w*`, otherwise `if` will be recognized as an ordinary identifier. Use `\b` (word boundary) for keywords so that `iffy` is not tokenized as `if` + `fy`.

How many tokens will the lexer produce for the expression `if (x >= 10)` ?

AST: the Expression Tree

A list of tokens is flat. It does not capture **structure**: the fact that `*` executes before `+`, or that `if` controls a block. The **parser** builds an **AST** (Abstract Syntax Tree) from tokens - a tree that reflects the hierarchy of expressions. ESLint, Prettier, Babel, and the TypeScript compiler all operate on AST transformations. Rust proc-macros receive an AST as input and return an AST as output.

**Operator precedence** is encoded in the tree's structure: `*` sits deeper than `+`, so it is evaluated first (leaves before root). Parentheses like `(2 + 3) * 4` change the tree's shape: `+` is pushed deeper than `*`. In Python: `import ast; print(ast.dump(ast.parse('x = 2 + 3 * 4')))` shows the real AST in one line.

The AST of real code is always inspectable: in Python `import ast; ast.dump(ast.parse(...))`, in Rust `cargo expand`, in Clang `clang -Xclang -ast-dump -fsyntax-only file.c`.

What does the root of the AST look like for the expression `(a + b) * c`?

From AST to Machine Code

The AST is the last human-readable representation. From here the compiler translates it into **machine code** - a sequence of instructions the CPU executes directly. GCC spends roughly 60% of compilation time on backend phases: register allocation, instruction selection, scheduling. LLVM extracted all of this into a reusable library - that is why Rust, Swift, and Julia get the same optimization quality for free.

x86 instruction	Action	Python equivalent
mov eax, 42	Load 42 into register eax	x = 42
add eax, ebx	eax = eax + ebx	x = x + y
imul eax, ecx	eax = eax * ecx	x = x * y
cmp eax, 0	Compare eax with 0 (set flags)	if x == 0
jne label	Jump to label if not equal	if ... : goto
call func	Call a function	func()
ret	Return from a function	return

Real code generation is far more involved: calling conventions (which arguments go in which registers), memory alignment, ABI compatibility, and exception handling all matter. GCC spends roughly 60% of its compilation time on backend phases.

AST and Parse Tree are the same thing - just different names

A Parse Tree (concrete syntax tree) contains ALL grammar rules, including parentheses, delimiters, and intermediate non-terminals. An AST (abstract syntax tree) contains only operations and values.

The Parse Tree for `(2 + 3)` includes nodes for the parentheses and intermediate grammar rules (expr -> term -> factor). The AST contains only `BinOp(+, 2, 3)`. Compilers build and work with ASTs because they are more compact.

Why is a linker needed if the compiler has already produced machine code?

Key Takeaways

Source code is a stream of bytes. Encoding (UTF-8) defines how characters map to bytes
The **lexer** groups characters into **tokens** - the smallest language units. Clang/GCC hand-write the lexer for speed; Tree-sitter makes it incremental for IDEs
The **parser** builds an **AST** from tokens. Tree structure encodes operator precedence. Parentheses are not needed in the AST - they are already captured by tree shape
**Codegen** translates the AST into machine instructions. The **linker** assembles object files by resolving cross-module references
LLVM extracted the entire backend into a reusable library - 60+ languages get the same optimization quality
Parse Tree retains all grammatical details; AST retains only operations and values. Compilers operate on ASTs

Вопросы для размышления

Does the lexer split `>=` into one token (GTE) or two (`>` and `=`)? How does the lexer decide? Hint: the maximal munch rule.
Could a compiler skip the AST and generate machine code directly from tokens? What limitations would that impose?
Why do real compilers (GCC, Clang) implement lexers by hand instead of using regex? Hint: speed and context handling.

Связанные уроки

comp-03-compiler-architecture — Passes and pipeline - next level after basic phases
comp-06-lexer-basics — Complete lexer implementation with error handling
comp-10-cfg — Formal grammars underpin every parser
fl-02-chomsky — Chomsky hierarchy explains why lexer (regex) and parser (CFG) are separate
arch-05-instruction-cycle — Machine code produced by codegen runs on the CPU instruction cycle
comp-33-v8 — V8 is a production example of JIT built on this same pipeline
fl-01-intro