Compilers

LR Parsing: SLR, LALR, CLR

PostgreSQL parses every SQL query through an LALR(1) parser generated by Bison. GCC historically used bison for C and C++. The Rust compiler started with LALR, then switched to hand-written recursive descent for better error messages. LR parsing is theory that runs in production every second on billions of devices.

**PostgreSQL** uses an LALR(1) Bison-generated parser for SQL: the PostgreSQL grammar has over 700 rules and produces a table with thousands of states
**GCC** historically used yacc/bison for parsing C, C++, and Fortran, before switching to hand-written recursive descent in GCC 3.4 to improve error diagnostics
**Ruby MRI** (CRuby) used a yacc grammar for parsing Ruby through version 1.8; the modern YJIT compiler still operates on this heritage

Shift-Reduce Parsing

GCC, the bison-based Rust compiler, the PostgreSQL parser - all use LR parsing. Why not LL? Because LR supports a far wider class of grammars, including left recursion, which breaks LL entirely. **Shift-reduce** is the basic operational model for LR: the parser manages a stack and an input stream, performing two operations. Shift: take the next token from input and push it onto the stack. Reduce: pop the top of the stack matching the right-hand side of some rule and replace it with the rule's left-hand side. Parsing completes when the stack contains just the start symbol and the input is exhausted.

LR stands for 'Left-to-right, Rightmost derivation in reverse': reads tokens left to right, builds the rightmost parse tree bottom-up. Unlike LL (top-down), LR defers the rule decision until it has seen enough right context. This is the key advantage: LR commits to a reduction only when the entire right-hand side is already on the stack.

What does the 'reduce' operation do in shift-reduce parsing?

LR Items and the State Automaton

How does the parser know when to shift and when to reduce? Through a finite automaton built from **LR items**. An LR(0) item is a grammar rule with a dot marking the current position in the right-hand side: `E -> E . '+' T` means 'E is already on the stack; waiting for +'. A set of such items forms a **state** of the automaton. Transition on symbol X: take all items with a dot before X, advance the dot right - that is the new state. This is the canonical LR(0) automaton.

Closure(I): if A -> alpha . B beta is in I, add all rules B -> . gamma to I (recursively). Goto(I, X): take all A -> alpha . X beta from I, apply closure to all A -> alpha X . beta. Real grammars may have thousands of states: the C++ LR(1) automaton has over 10 000 states.

What does the LR item `E -> E + . T` represent during parsing?

ACTION and GOTO Tables: SLR vs LALR vs CLR

The LR automaton is encoded in two tables. **ACTION[state, token]**: shift s (go to state s), reduce r (apply rule r), or accept/error. **GOTO[state, nonterminal]**: which state to enter after a reduction. The difference between **SLR, LALR, and CLR** lies in how the reduce entry is populated: SLR uses FOLLOW (coarse), LALR uses merged lookaheads (a compromise), CLR(1) uses precise LR(1) items with exact lookaheads (most precise, but expensive). Bison defaults to LALR(1).

SLR: reduce A -> alpha in state s on token a if a is in FOLLOW(A). Most conflicts. LALR(1): builds the LR(1) automaton, then merges states that share the same LR(0) cores. Fewer conflicts; table has the same number of rows as SLR. CLR(1) (canonical LR): full LR(1) automaton; table can be 10-100x larger than LALR(1). GCC switched from bison-LALR to hand-written recursive descent in GCC 3.4 specifically for better error messages.

How does LALR(1) differ from CLR(1) given the same lookahead k=1?

Conflicts: Shift/Reduce and Reduce/Reduce

A conflict in an LR table means a single ACTION cell contains two different actions. **Shift/Reduce conflict**: in state s on token a, the parser could either shift or reduce. The classic example is the 'dangling else': `if E then if E then S else S` - which `if` does the `else` belong to? Bison resolves this by default in favor of shift (else binds to the nearest if). **Reduce/Reduce conflict**: on a single token, two different rules could be reduced - a signal of grammar ambiguity that requires manual correction.

Conflict resolution in LALR generators: Shift/Reduce - operator precedence (%left, %right, %nonassoc in Bison) and associativity. If the token has higher precedence than the rule - shift; if lower - reduce; if equal - check associativity. Reduce/Reduce conflicts are almost always a grammar design error: the rules should be rewritten rather than relying on 'default resolution'.

A shift/reduce conflict in a parser is always an error that must be eliminated

Shift/reduce conflicts are often resolved deliberately through operator precedence; reduce/reduce conflicts are almost always a sign of grammar ambiguity

Most programming languages have ambiguous expression grammars: without %left/%right Bison reports conflicts, but with correct precedence declarations the parser is correct. Reduce/reduce conflicts are the dangerous ones - they signal a design problem that requires grammar rewriting.

Bison declares `%left '+'` followed by `%left '*'`. How does this affect parsing of `3 + 4 * 2`?

Key Ideas

**Shift-reduce** builds the AST bottom-up: shift pushes a token, reduce applies a rule - the decision is made only when the entire right-hand side is on the stack
**SLR < LALR < CLR**: growing precision of lookahead and expressive power, but also growing table size; LALR(1) is the sweet spot used in Bison/Yacc
**Shift/Reduce conflicts** are resolved via precedence (%left/%right); **Reduce/Reduce conflicts** signal an ambiguous grammar that needs redesign

Вопросы для размышления

GCC switched from an LALR parser to hand-written recursive descent to improve error messages. What properties of LR parsers make generating high-quality error messages a hard problem?
Why is a static LALR(1) approach inconvenient for languages with user-defined operators (Haskell, Scala) - and how do these languages solve expression parsing with arbitrary operators?
LALR(1) merges states of the LR(1) automaton. In what situation does this merging introduce new conflicts that did not exist in the full CLR(1) automaton?

Связанные уроки

comp-11-recursive-descent — Understanding top-down parsing helps contrast with LR
comp-10-cfg — LR handles a strict superset of LL grammars
comp-14-parser-generators — YACC/Bison generate LR parsers from grammar specs
fl-12-cfg — LR(k) grammar class is the formal basis
ds-03-stacks — LR parser state machine uses explicit stack for shift-reduce
plt-25-parser

Compilers

LR Parsing: SLR, LALR, CLR

**PostgreSQL** uses an LALR(1) Bison-generated parser for SQL: the PostgreSQL grammar has over 700 rules and produces a table with thousands of states
**GCC** historically used yacc/bison for parsing C, C++, and Fortran, before switching to hand-written recursive descent in GCC 3.4 to improve error diagnostics
**Ruby MRI** (CRuby) used a yacc grammar for parsing Ruby through version 1.8; the modern YJIT compiler still operates on this heritage

Shift-Reduce Parsing

What does the 'reduce' operation do in shift-reduce parsing?

LR Items and the State Automaton

What does the LR item `E -> E + . T` represent during parsing?

ACTION and GOTO Tables: SLR vs LALR vs CLR

How does LALR(1) differ from CLR(1) given the same lookahead k=1?

Conflicts: Shift/Reduce and Reduce/Reduce

A shift/reduce conflict in a parser is always an error that must be eliminated

Shift/reduce conflicts are often resolved deliberately through operator precedence; reduce/reduce conflicts are almost always a sign of grammar ambiguity

Bison declares `%left '+'` followed by `%left '*'`. How does this affect parsing of `3 + 4 * 2`?

Key Ideas

**Shift-reduce** builds the AST bottom-up: shift pushes a token, reduce applies a rule - the decision is made only when the entire right-hand side is on the stack
**SLR < LALR < CLR**: growing precision of lookahead and expressive power, but also growing table size; LALR(1) is the sweet spot used in Bison/Yacc
**Shift/Reduce conflicts** are resolved via precedence (%left/%right); **Reduce/Reduce conflicts** signal an ambiguous grammar that needs redesign

Вопросы для размышления

GCC switched from an LALR parser to hand-written recursive descent to improve error messages. What properties of LR parsers make generating high-quality error messages a hard problem?
Why is a static LALR(1) approach inconvenient for languages with user-defined operators (Haskell, Scala) - and how do these languages solve expression parsing with arbitrary operators?
LALR(1) merges states of the LR(1) automaton. In what situation does this merging introduce new conflicts that did not exist in the full CLR(1) automaton?

Связанные уроки

comp-11-recursive-descent — Understanding top-down parsing helps contrast with LR
comp-10-cfg — LR handles a strict superset of LL grammars
comp-14-parser-generators — YACC/Bison generate LR parsers from grammar specs
fl-12-cfg — LR(k) grammar class is the formal basis
ds-03-stacks — LR parser state machine uses explicit stack for shift-reduce
plt-25-parser

LR Parsing: SLR, LALR, CLR

Shift-Reduce Parsing

LR Items and the State Automaton

ACTION and GOTO Tables: SLR vs LALR vs CLR

Conflicts: Shift/Reduce and Reduce/Reduce

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

LR Parsing: SLR, LALR, CLR

Shift-Reduce Parsing

LR Items and the State Automaton

ACTION and GOTO Tables: SLR vs LALR vs CLR

Conflicts: Shift/Reduce and Reduce/Reduce

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки