Compilers

How a Computer Reads Code

John Backus and the First Real Compiler

In 1957, John Backus's team at IBM shipped the first FORTRAN compiler. Before that, every program was hand-written in assembly. Colleagues skeptically called the project 'word games' - they did not believe a machine could generate code as well as a human. The FORTRAN compiler proved them wrong: the generated code was nearly as efficient as handwritten assembly. Backus also invented BNF notation (Backus-Naur Form) - the standard way of writing programming language grammars.

Цели урока

  • Understand that source code is a stream of bytes and how encoding determines character mapping
  • Explain the lexer's role: grouping characters into tokens while tracking position
  • Distinguish Parse Tree from AST, build an AST for arithmetic expressions
  • Understand how codegen translates AST into machine instructions and the role of the linker

Every Python line, every React component, every Rust backend request - all pass through the same pipeline. LLVM compiles 60+ languages. V8 JIT-compiles JavaScript into native code 100x faster than interpretation. Behind all that variety sits one architecture: characters -> tokens -> AST -> IR -> machine code.

  • **V8 (Google Chrome, Node.js)** - JIT-compiles JS through AST -> Ignition bytecode -> TurboFan machine code. The same pipeline as GCC
  • **LLVM** - backend that compiles Rust, Swift, C++, Julia, Kotlin/Native. 60+ frontend languages, one IR
  • **TypeScript compiler (tsc)** - complete pipeline: lexer -> parser -> type checker -> emitter. 100K+ lines of TypeScript
  • **Python bytecode** - CPython compiles to .pyc (bytecode) executed by the VM. `dis.dis(func)` shows the result
  • **ESLint/Prettier/Babel** - all operate on AST transformations. An ESLint plugin is a visitor over AST nodes

Предварительные знания

  • What Is a Compiler

Source Code as a Stream of Characters

**LLVM compiles Rust, C++, Swift, Julia, and 60+ other languages.** Different syntaxes, different semantics - one pipeline. Because at the input, all these languages look identical: a stream of bytes. `x = 42 + y` for the compiler is `78 20 3D 20 34 32 20 2B 20 79`. No variables, no operators - just numbers. The job of the first phase is to reconstruct **structure and meaning** from that raw stream.

**Encoding** determines how characters map to bytes. ASCII uses 1 byte (128 characters). UTF-8 uses 1-4 bytes (the full Unicode range). Python 3, Rust, Go, and Swift all use UTF-8. GCC supports files in other encodings via `-finput-charset`.

EncodingCharacter sizeRangeExample
ASCII1 byte128 charactersA-Z, a-z, 0-9, +, =, ;
Latin-11 byte256 charactersASCII + accented characters
UTF-81-4 bytes1,112,064 charactersEverything: CJK, emoji, Cyrillic
UTF-162-4 bytes1,112,064 charactersUsed internally by Java and JS

The compiler reads a file **sequentially**, one character at a time, left to right, top to bottom. It tracks the current **position** (line:column) for error messages. When seeing `error at line 42, column 15` - that is the position tracker inside the lexer doing its job.

The BOM (Byte Order Mark) is the invisible character `U+FEFF` that some editors (Notepad on Windows) silently prepend to UTF-8 files. Many mysterious 'the code looks correct but won't compile' bugs are caused by invisible Unicode characters.

Why does the compiler need to know the position (line:column) of each character?

Tokens: the Atoms of a Language

**Lexical analysis (lexing)** is the compiler's first phase. The lexer groups characters into **tokens** - the smallest meaningful units of the language. Clang and GCC hand-write the lexer using switch/case: regex gives the right result, but production compilers need throughput to process millions of lines per second. Tree-sitter, used by VS Code for code navigation, is an incremental lexer - when a single line changes, only the modified fragment gets re-tokenized.

CategoryExamplesRegex pattern
Keywordif, else, while, returnFixed list
Identifierx, myVar, calculate[a-zA-Z_][a-zA-Z0-9_]*
Number42, 3.14, 0xFF[0-9]+(\\.[0-9]+)?
String"hello", 'world'"[^"]*"
Operator+, -, *, ==, !=Fixed list
Delimiter(, ), {, }, ;Single characters
Whitespacespace, tab, newlineUsually discarded

**Rule order matters!** The keyword `if` must be checked *before* the general identifier pattern `[a-zA-Z_]\w*`, otherwise `if` will be recognized as an ordinary identifier. Use `\b` (word boundary) for keywords so that `iffy` is not tokenized as `if` + `fy`.

How many tokens will the lexer produce for the expression `if (x >= 10)` ?

AST: the Expression Tree

A list of tokens is flat. It does not capture **structure**: the fact that `*` executes before `+`, or that `if` controls a block. The **parser** builds an **AST** (Abstract Syntax Tree) from tokens - a tree that reflects the hierarchy of expressions. ESLint, Prettier, Babel, and the TypeScript compiler all operate on AST transformations. Rust proc-macros receive an AST as input and return an AST as output.

**Operator precedence** is encoded in the tree's structure: `*` sits deeper than `+`, so it is evaluated first (leaves before root). Parentheses like `(2 + 3) * 4` change the tree's shape: `+` is pushed deeper than `*`. In Python: `import ast; print(ast.dump(ast.parse('x = 2 + 3 * 4')))` shows the real AST in one line.

The AST of real code is always inspectable: in Python `import ast; ast.dump(ast.parse(...))`, in Rust `cargo expand`, in Clang `clang -Xclang -ast-dump -fsyntax-only file.c`.

What does the root of the AST look like for the expression `(a + b) * c`?

From AST to Machine Code

The AST is the last human-readable representation. From here the compiler translates it into **machine code** - a sequence of instructions the CPU executes directly. GCC spends roughly 60% of compilation time on backend phases: register allocation, instruction selection, scheduling. LLVM extracted all of this into a reusable library - that is why Rust, Swift, and Julia get the same optimization quality for free.

x86 instructionActionPython equivalent
mov eax, 42Load 42 into register eaxx = 42
add eax, ebxeax = eax + ebxx = x + y
imul eax, ecxeax = eax * ecxx = x * y
cmp eax, 0Compare eax with 0 (set flags)if x == 0
jne labelJump to label if not equalif ... : goto
call funcCall a functionfunc()
retReturn from a functionreturn

Real code generation is far more involved: calling conventions (which arguments go in which registers), memory alignment, ABI compatibility, and exception handling all matter. GCC spends roughly 60% of its compilation time on backend phases.

AST and Parse Tree are the same thing - just different names

A Parse Tree (concrete syntax tree) contains ALL grammar rules, including parentheses, delimiters, and intermediate non-terminals. An AST (abstract syntax tree) contains only operations and values.

The Parse Tree for `(2 + 3)` includes nodes for the parentheses and intermediate grammar rules (expr -> term -> factor). The AST contains only `BinOp(+, 2, 3)`. Compilers build and work with ASTs because they are more compact.

Why is a linker needed if the compiler has already produced machine code?

Key Takeaways

  • Source code is a stream of bytes. Encoding (UTF-8) defines how characters map to bytes
  • The **lexer** groups characters into **tokens** - the smallest language units. Clang/GCC hand-write the lexer for speed; Tree-sitter makes it incremental for IDEs
  • The **parser** builds an **AST** from tokens. Tree structure encodes operator precedence. Parentheses are not needed in the AST - they are already captured by tree shape
  • **Codegen** translates the AST into machine instructions. The **linker** assembles object files by resolving cross-module references
  • LLVM extracted the entire backend into a reusable library - 60+ languages get the same optimization quality
  • Parse Tree retains all grammatical details; AST retains only operations and values. Compilers operate on ASTs

Related Topics

Each stage of the pipeline is covered in its own lesson:

  • Compiler Architecture — How passes and the pipeline are organized, and how code flows through them
  • Lexer Fundamentals — A complete lexer implementation with error handling, strings, and comments
  • Context-Free Grammars — The formal rules by which a parser builds an AST

Вопросы для размышления

  • Does the lexer split `>=` into one token (GTE) or two (`>` and `=`)? How does the lexer decide? Hint: the maximal munch rule.
  • Could a compiler skip the AST and generate machine code directly from tokens? What limitations would that impose?
  • Why do real compilers (GCC, Clang) implement lexers by hand instead of using regex? Hint: speed and context handling.

Связанные уроки

  • comp-03-compiler-architecture — Passes and pipeline - next level after basic phases
  • comp-06-lexer-basics — Complete lexer implementation with error handling
  • comp-10-cfg — Formal grammars underpin every parser
  • fl-02-chomsky — Chomsky hierarchy explains why lexer (regex) and parser (CFG) are separate
  • arch-05-instruction-cycle — Machine code produced by codegen runs on the CPU instruction cycle
  • comp-33-v8 — V8 is a production example of JIT built on this same pipeline
  • fl-01-intro
How a Computer Reads Code

0

1

Sign In