Compilers
LR Parsing: SLR, LALR, CLR
PostgreSQL parses every SQL query through an LALR(1) parser generated by Bison. GCC historically used bison for C and C++. The Rust compiler started with LALR, then switched to hand-written recursive descent for better error messages. LR parsing is theory that runs in production every second on billions of devices.
- **PostgreSQL** uses an LALR(1) Bison-generated parser for SQL: the PostgreSQL grammar has over 700 rules and produces a table with thousands of states
- **GCC** historically used yacc/bison for parsing C, C++, and Fortran, before switching to hand-written recursive descent in GCC 3.4 to improve error diagnostics
- **Ruby MRI** (CRuby) used a yacc grammar for parsing Ruby through version 1.8; the modern YJIT compiler still operates on this heritage
Shift-Reduce Parsing
GCC, the bison-based Rust compiler, the PostgreSQL parser - all use LR parsing. Why not LL? Because LR supports a far wider class of grammars, including left recursion, which breaks LL entirely. **Shift-reduce** is the basic operational model for LR: the parser manages a stack and an input stream, performing two operations. Shift: take the next token from input and push it onto the stack. Reduce: pop the top of the stack matching the right-hand side of some rule and replace it with the rule's left-hand side. Parsing completes when the stack contains just the start symbol and the input is exhausted.
LR stands for 'Left-to-right, Rightmost derivation in reverse': reads tokens left to right, builds the rightmost parse tree bottom-up. Unlike LL (top-down), LR defers the rule decision until it has seen enough right context. This is the key advantage: LR commits to a reduction only when the entire right-hand side is already on the stack.
What does the 'reduce' operation do in shift-reduce parsing?
LR Items and the State Automaton
How does the parser know when to shift and when to reduce? Through a finite automaton built from **LR items**. An LR(0) item is a grammar rule with a dot marking the current position in the right-hand side: `E -> E . '+' T` means 'E is already on the stack; waiting for +'. A set of such items forms a **state** of the automaton. Transition on symbol X: take all items with a dot before X, advance the dot right - that is the new state. This is the canonical LR(0) automaton.
Closure(I): if A -> alpha . B beta is in I, add all rules B -> . gamma to I (recursively). Goto(I, X): take all A -> alpha . X beta from I, apply closure to all A -> alpha X . beta. Real grammars may have thousands of states: the C++ LR(1) automaton has over 10 000 states.
What does the LR item `E -> E + . T` represent during parsing?
ACTION and GOTO Tables: SLR vs LALR vs CLR
The LR automaton is encoded in two tables. **ACTION[state, token]**: shift s (go to state s), reduce r (apply rule r), or accept/error. **GOTO[state, nonterminal]**: which state to enter after a reduction. The difference between **SLR, LALR, and CLR** lies in how the reduce entry is populated: SLR uses FOLLOW (coarse), LALR uses merged lookaheads (a compromise), CLR(1) uses precise LR(1) items with exact lookaheads (most precise, but expensive). Bison defaults to LALR(1).
SLR: reduce A -> alpha in state s on token a if a is in FOLLOW(A). Most conflicts. LALR(1): builds the LR(1) automaton, then merges states that share the same LR(0) cores. Fewer conflicts; table has the same number of rows as SLR. CLR(1) (canonical LR): full LR(1) automaton; table can be 10-100x larger than LALR(1). GCC switched from bison-LALR to hand-written recursive descent in GCC 3.4 specifically for better error messages.
How does LALR(1) differ from CLR(1) given the same lookahead k=1?
Conflicts: Shift/Reduce and Reduce/Reduce
A conflict in an LR table means a single ACTION cell contains two different actions. **Shift/Reduce conflict**: in state s on token a, the parser could either shift or reduce. The classic example is the 'dangling else': `if E then if E then S else S` - which `if` does the `else` belong to? Bison resolves this by default in favor of shift (else binds to the nearest if). **Reduce/Reduce conflict**: on a single token, two different rules could be reduced - a signal of grammar ambiguity that requires manual correction.
Conflict resolution in LALR generators: Shift/Reduce - operator precedence (%left, %right, %nonassoc in Bison) and associativity. If the token has higher precedence than the rule - shift; if lower - reduce; if equal - check associativity. Reduce/Reduce conflicts are almost always a grammar design error: the rules should be rewritten rather than relying on 'default resolution'.
A shift/reduce conflict in a parser is always an error that must be eliminated
Shift/reduce conflicts are often resolved deliberately through operator precedence; reduce/reduce conflicts are almost always a sign of grammar ambiguity
Most programming languages have ambiguous expression grammars: without %left/%right Bison reports conflicts, but with correct precedence declarations the parser is correct. Reduce/reduce conflicts are the dangerous ones - they signal a design problem that requires grammar rewriting.
Bison declares `%left '+'` followed by `%left '*'`. How does this affect parsing of `3 + 4 * 2`?
Key Ideas
- **Shift-reduce** builds the AST bottom-up: shift pushes a token, reduce applies a rule - the decision is made only when the entire right-hand side is on the stack
- **SLR < LALR < CLR**: growing precision of lookahead and expressive power, but also growing table size; LALR(1) is the sweet spot used in Bison/Yacc
- **Shift/Reduce conflicts** are resolved via precedence (%left/%right); **Reduce/Reduce conflicts** signal an ambiguous grammar that needs redesign
Related Topics
LR parsing builds on CFGs and is the bottom-up alternative to the LL approach:
- Recursive Descent (LL) — LL and LR are opposite parsing strategies; LR supports a wider grammar class but is harder to read and write by hand
- Context-Free Grammars — LR parsing is built on CFGs; understanding FIRST/FOLLOW is necessary to analyze conflicts in SLR
Вопросы для размышления
- GCC switched from an LALR parser to hand-written recursive descent to improve error messages. What properties of LR parsers make generating high-quality error messages a hard problem?
- Why is a static LALR(1) approach inconvenient for languages with user-defined operators (Haskell, Scala) - and how do these languages solve expression parsing with arbitrary operators?
- LALR(1) merges states of the LR(1) automaton. In what situation does this merging introduce new conflicts that did not exist in the full CLR(1) automaton?
Связанные уроки
- comp-11-recursive-descent — Understanding top-down parsing helps contrast with LR
- comp-10-cfg — LR handles a strict superset of LL grammars
- comp-14-parser-generators — YACC/Bison generate LR parsers from grammar specs
- fl-12-cfg — LR(k) grammar class is the formal basis
- ds-03-stacks — LR parser state machine uses explicit stack for shift-reduce
- plt-25-parser