Formal Languages

LL and LR Parsing: From Grammar to Parser Table

GCC's parser was LALR(1) (Yacc-generated) until version 3.x. The C++ front-end kept adding hacks to thread symbol table information back into the parser to resolve type-dependent ambiguities. By GCC 4.0, the team threw out the generated parser and rewrote it as hand-coded recursive descent. The lesson: LALR theory meets language reality, and sometimes reality wins.

**Bison + Flex** generates the parsers for PostgreSQL's SQL dialect, the Linux kernel's Kconfig language, and GNU bc. The PostgreSQL grammar has 800+ production rules and 120 known shift-reduce conflicts, all explicitly documented and resolved with %prec directives.
**ANTLR 4** powers the parsers for Kotlin (JetBrains), Twitter's Finagle DSL, and Hive SQL. Its ALL(*) algorithm handles left-recursive grammars natively and generates parse trees with listener/visitor patterns for semantic analysis.
**Python switched from LL(1) to PEG** in Python 3.9 (PEP 617). The old pgen LL(1) parser had accumulated 15 years of grammar workarounds for constructs that are not LL(1). PEG parsers handle arbitrary lookahead via memoization, at the cost of O(n*r) space (n input tokens, r grammar rules).

LL(k) Parsing: FIRST, FOLLOW, and Recursive Descent

Every programming language parser in production started with a grammar. The question is always the same: given an input string, which production rule applies at this point in the parse? LL(k) parsing answers this question by looking Left-to-right through the input, producing a Leftmost derivation, using k lookahead tokens to make the decision. LL(1) - one token of lookahead - is the sweet spot: capable enough for most programming languages, simple enough to implement by hand.

FIRST(A) is the set of terminals that can begin a string derived from nonterminal A. FOLLOW(A) is the set of terminals that can appear immediately after A in some sentential form. Together they define the parse table. A grammar is LL(1) if and only if for every nonterminal A with productions A -> alpha | beta, FIRST(alpha) and FIRST(beta) are disjoint, and if epsilon is in FIRST(alpha), then FIRST(beta) and FOLLOW(A) are disjoint.

Left recursion kills LL parsers. The grammar E -> E + T immediately loops: to parse E, call E, which calls E, infinitely. Left-recursive grammars must be transformed before LL parsing. The canonical transformation: E -> E + T becomes E -> T E', E' -> + T E' | epsilon. This elimination is mechanical and always possible for non-ambiguous grammars. ANTLR 4 handles left recursion automatically by detecting and rewriting it at grammar build time.

LL(1) LL(k) hierarchy: most programming language constructs are LL(1). The exceptions are common: dangling else (needs disambiguation rule), C++ template syntax (requires backtracking or GLR), Python's augmented assignments. ANTLR 4 implements ALL(*) - Adaptive LL(*) - which uses a runtime DFA constructed lazily, handling arbitrary lookahead without explicit k. This makes ANTLR practical for languages that are not formally LL(k) for any fixed k.

Grammar G: A -> aB | aC. Is G LL(1)? What information determines the answer?

LR Parsing: Items, States, and Shift-Reduce

LR parsing reverses the derivation. Instead of predicting which production to apply (top-down), an LR parser reads tokens, shifts them onto a stack, and reduces when the top of the stack matches the right-hand side of a production (bottom-up). LR(k): Left-to-right scan, Rightmost derivation in reverse, k lookahead tokens. LR(1) recognizes every deterministic context-free language - strictly stronger than LL(1). The grammar E -> E + T (left-recursive) that breaks LL parsers is directly handled by LR.

The LR automaton is built from item sets. An LR(0) item is a production with a dot marking the current parser position: [E -> E . + T]. The dot represents 'this much has been parsed.' The item set (state) is the closure of all items reachable from the current position. A shift action moves the dot past a terminal by consuming the next input token. A reduce action fires when the dot is at the end of a production: [E -> E + T .] - reduce using rule E -> E + T.

Shift-reduce conflicts: the parser has a stack configuration where it can either shift the next token or reduce the top of the stack. The canonical case: the dangling else. Grammar: Stmt -> if Expr then Stmt | if Expr then Stmt else Stmt. After parsing 'if E then if E then S', the next token is 'else' - shift (associate else with inner if) or reduce (associate with outer if)? Both are valid by the grammar, making it ambiguous. Most languages resolve this by convention (else associates with the nearest if) implemented as a shift-prefer-over-reduce rule.

Reduce-reduce conflicts are more serious than shift-reduce. Two different productions can both reduce the current stack top. This usually indicates a genuine grammar ambiguity that cannot be resolved by convention. Yacc/Bison reports reduce-reduce conflicts as warnings (defaulting to the first production) but the result is often wrong. The correct fix is to restructure the grammar.

An LR parser is in state s where items include [A -> alpha . beta] and [B -> gamma .]. The next input token is in FOLLOW(B). What actions are possible?

LALR(1) vs SLR vs Canonical LR: The Practical Hierarchy

SLR(1) (Simple LR) uses FOLLOW sets to decide when to reduce. A state with item [A -> alpha .] reduces when the next token is in FOLLOW(A). The weakness: FOLLOW is computed globally - it includes every context where A appears, not just the current parse context. This causes false conflicts. SLR(1) rejects grammars that are perfectly deterministic because its lookahead sets are too coarse.

Canonical LR(1) embeds lookahead directly into items: [A -> alpha ., a] means 'reduce A -> alpha when the next token is a'. This eliminates false conflicts but generates an enormous number of states. A grammar with 50 productions generates hundreds of item sets in LR(1), each with its own lookahead context. The parse table becomes impractical to store for hand-written grammars.

LALR(1) merges LR(1) states that have identical core items (items without lookahead). Merged states share lookahead - this is the source of LALR's one weakness: merging can introduce reduce-reduce conflicts that did not exist in canonical LR(1). In practice, all common programming languages (C, C++, Java, Python, Ruby) have LALR(1) grammars or nearly so. C++ is the famous exception: its type-dependent parsing requires GLR or a symbol table feedback loop.

Yacc and Bison generate LALR(1) parsers. Input is a grammar + semantic actions. Output is a C/C++ file implementing the LR state machine. The canonical combination: Flex (lexer) + Bison (parser). Bison's `%expect N` directive suppresses warnings for N known shift-reduce conflicts, usually the dangling-else conflict resolved by convention. ANTLR 4 is the LL(*) alternative: generates Java, Python, C++, TypeScript parsers from a unified grammar syntax.

A grammar that has shift-reduce conflicts is ambiguous and must be rewritten

Shift-reduce conflicts indicate a decision point that requires a resolution rule, not necessarily grammar ambiguity. The dangling-else grammar is unambiguous by intent with the 'prefer shift' convention. Yacc/Bison's %left/%right/%nonassoc directives resolve operator precedence conflicts without rewriting the grammar.

Many natural grammars have shift-reduce conflicts that reflect real design choices (operator associativity, else binding). The conflict is in the parser generation algorithm, not in the language's determinism. Understanding this distinction is essential for practical parser engineering.

Why does LALR(1) have the same number of parse table states as SLR(1) but fewer false conflicts?

Итоги

**LL(1)**: top-down, predicts production from FIRST/FOLLOW sets. LL(1) condition: FIRST sets of alternatives must be disjoint. Left recursion must be eliminated. Implemented as recursive descent.
**LR parsing**: bottom-up, shift tokens then reduce by matching productions. LR(0) items track parser position. Shift-reduce and reduce-reduce conflicts signal grammar decisions.
**SLR vs LALR vs Canonical LR**: SLR uses global FOLLOW (coarse, more false conflicts). LALR uses per-state lookahead (same state count as SLR, fewer conflicts). Canonical LR is the strongest but state-count explodes.
**Real tools**: Yacc/Bison for LALR(1) (PostgreSQL, Linux Kconfig). ANTLR 4 for ALL(*) LL (Kotlin, Python, Twitter). Python 3.9+ uses PEG.

Вопросы для размышления

LL(1) requires eliminating left recursion. LR parsers handle left recursion naturally. Does this mean LR grammars are always easier to write? What trade-offs exist in grammar engineering for each approach?
LALR(1) merging can introduce reduce-reduce conflicts absent in canonical LR(1). Construct a minimal grammar example where LALR(1) fails but LR(1) succeeds. What does this tell about the practical limits of LALR?
Python's switch from LL(1) to PEG reflects a pragmatic choice. What properties of a language design team (maintenance burden, tooling, community) should influence parser algorithm selection?

Связанные уроки

fl-12-cfg — LL and LR parsers operate on context-free grammars
fl-15-cyk — CYK parsing establishes the CFG recognition baseline that LL/LR improve on
comp-11-recursive-descent — Recursive descent is the hand-coded implementation of LL(1) parsing
comp-12-lr-parsing — LR parsing theory from formal languages maps to practical shift-reduce parsers in compilers
comp-14-parser-generators — ANTLR (LL) and Bison (LALR) are production parser generators built on these algorithms