Compilers
Recursive Descent Parsing
The Go compiler's parser is hand-written using recursive descent - Rob Pike deliberately rejected parser generators for the sake of readability and error message quality. CPython, V8, Clang - all of them use recursive descent. This is not nostalgia; it is an engineering choice: when the grammar fits in one's head, a handwritten parser remains the most understandable code in the compiler.
- **Go compiler** is written entirely by hand using recursive descent - Rob Pike believed a parser generator would harm error message quality and complicate long-term maintenance
- **TypeScript** uses a predictive parser with manual lookahead: `parseExpression()` calls `parseBinaryExpression()`, which recursively calls itself with operator-precedence awareness
- **ANTLR** generates LL(*) parsers - a generalization of LL(1) that determines lookahead adaptively at runtime rather than from a fixed k
Top-Down Parsing
The Go compiler is hand-written using recursive descent - and not by accident. **Top-down parsing** builds the parse tree from the root down: it starts with the start nonterminal and attempts to expand it into terminals by reading tokens left to right. Each grammar nonterminal maps directly to one parser function - an elegant correspondence between the grammar and the code. When the Go compiler parses `func foo(x int) {}`, the function `parseFunction()` calls `parseParams()`, which calls `parseType()` - the call tree mirrors the syntax tree.
Backtracking top-down parsing tries all alternatives of a rule and backtracks on failure: exponential worst-case complexity. Predictive (LL) parsing eliminates backtracking by constraining the grammar: a single lookahead token always uniquely identifies the applicable rule. Most practical compilers use predictive parsing.
Why does recursive descent with backtracking have exponential worst-case complexity?
Predictive Parsing
Predictive parsing is recursive descent without blind turns. The key insight: by looking at the next token, the parser always knows exactly which rule to apply - no backtracking needed. This is only possible when the grammar has no conflicts: two different rules for the same nonterminal must not begin with the same token. Python and JavaScript use predictive parsers: that is exactly why syntax errors in these languages point to the precise location of the problem rather than just 'something went wrong'.
A predictive parser can be implemented two ways: recursive (a function per nonterminal, lookahead via if/switch) or tabular LL(k) (a transition table driven by a stack). The recursive form is easier to write and debug; the tabular form is easier to generate automatically. ANTLR generates predictive parsers with arbitrary lookahead (LL(*)).
What does a predictive parser do when it sees the next token?
FIRST and FOLLOW Sets
For a predictive parser to work, we need a formal description of which tokens can start each rule. Two sets provide this. **FIRST(A)** is the set of terminals that can begin any string derivable from nonterminal A. **FOLLOW(A)** is the set of terminals that can follow A in any derivation. FOLLOW is needed for epsilon rules: if A -> epsilon, then A is applied when the current token is in FOLLOW(A).
FIRST algorithm: if A -> a..., then a is in FIRST(A); if A -> B..., then FIRST(B) is in FIRST(A); if A -> epsilon, then epsilon is in FIRST(A). FOLLOW(A): $ is in FOLLOW(S) (start symbol); if B -> ...AC..., then FIRST(C) \ {epsilon} is in FOLLOW(A); if epsilon is in FIRST(C), then FOLLOW(B) is in FOLLOW(A).
Why is the FOLLOW set needed when building a predictive parser?
LL(1) Grammars and Parse Tables
A grammar is **LL(1)** if a predictive parser with a single lookahead token always makes a unique decision. This is a strict condition: no two rules for the same nonterminal may compete for the same token. Left recursion (A -> A b) violates LL(1) unconditionally - the parser enters an infinite loop. Left-recursive rules must be transformed into right recursion or iteration. Most industrial language grammars (Java, Go, TypeScript) are not strictly LL(1), so their parsers use extensions: LL(k), LL(*), or manual multi-token lookahead.
LL(1) parse table: rows are nonterminals, columns are terminals + $. Cell M[A, a] contains rule A -> alpha if: a is in FIRST(alpha), or (epsilon is in FIRST(alpha) and a is in FOLLOW(A)). A conflict (multiple rules in one cell) means the grammar is not LL(1). Left factoring and left recursion elimination are standard transformations to bring a grammar toward LL(1).
Any grammar can be made LL(1) by applying left factoring and left recursion elimination
Some grammars are fundamentally non-LL(1): for example, ambiguous grammars or those requiring more than one lookahead token to resolve a choice
The class of LL(1) grammars is strictly smaller than the class of context-free grammars. When an LL(1) grammar is not achievable, alternatives include LR parsing, PEG, or manual multi-token lookahead.
A grammar contains the rule A -> A '+' B | B (left recursion). Why does this violate LL(1)?
Key Ideas
- **Top-down = one function per nonterminal**: elegant grammar-to-code correspondence, readable parser, excellent error messages
- **FIRST/FOLLOW** define the LL(1) parse table: FIRST - what can start a rule, FOLLOW - what follows a nonterminal (needed for epsilon rules)
- **Left recursion breaks LL(1)**: eliminated by transforming A -> A a | b into A -> b A', A' -> a A' | eps; not all grammars are reducible to LL(1)
Related Topics
Recursive descent is the foundation of top-down parsing, connected to adjacent topics:
- Context-Free Grammars — LL(1) is a subclass of CFGs; understanding CFGs is required to build FIRST/FOLLOW sets and parse tables
- LR Parsing: SLR, LALR, CLR — LR parsing is the bottom-up alternative to LL; it supports a wider class of grammars without the LL(1) restrictions
Вопросы для размышления
- A language grammar contains 'if E then S else S | if E then S'. Why is this a problem for LL(1), and how do practical compilers like those for C or Java resolve it?
- How is a hand-written recursive descent parser better than a tabular LL(1) parser for error messages - and how is it worse from an automation standpoint?
- If a language requires an LR(1) grammar, should the parser be written by hand or generated? What are the tradeoffs of each approach?
Связанные уроки
- comp-10-cfg — Recursive descent implements CFG grammar rules
- comp-12-lr-parsing — Compare top-down with bottom-up parsing
- comp-14-parser-generators — Generators automate what recursive descent does by hand
- fl-12-cfg — LL(k) grammar class is what recursive descent handles
- plt-25-parser