Compilers

PEG and Packrat Parsing

For years the Python parser suffered from LL(1) limitations of the pgen generator: f-strings were parsed with a separate regular expression inside the lexer, expression-specific constructs needed hacks, and pattern matching seemed impossible. In 2020 Guido van Rossum, Pablo Galindo, and Lysandros Nikolaou rewrote the Python core to use a PEG parser (PEP 617). Pattern matching arrived in Python 3.10 a year later - the grammar finally allowed it. The story of PEG is a story of how switching the parsing model from CFG to deterministic PEG unlocked language features that had been out of reach for 20 years.

  • **Python 3.9+ (PEP 617):** the language core is parsed by a PEG parser with memoization - that is precisely what enabled structural pattern matching
  • **Peggy / PEG.js:** PEG parser generators for JavaScript, used in Babel, Markdown-It, and hundreds of DSL tools
  • **lark / pyparsing / parsimmon:** PEG composition libraries for building DSLs, configuration languages, and query parsers

PEG: Grammars Without Ambiguity

Chomsky's context-free grammars (CFGs) describe a set of strings - but not how to parse them. The classical arithmetic grammar allows 1+2*3 to be parsed as (1+2)*3 or 1+(2*3); the CFG says 'both are valid' even though the meaning differs. To obtain a parser the engineer adds precedences, resolves shift/reduce conflicts, and fights ambiguity. In 2004 Bryan Ford (MIT) proposed Parsing Expression Grammars (PEG) - a family of grammars in which ambiguity is impossible by construction. PEG describes not a set of strings but a parsing algorithm. Every rule is a deterministic function that returns either success with a match length or failure.

PEG operators: e1 e2 - sequence; e1 / e2 - ordered choice (try e1, on failure try e2); e* - zero or more; e+ - one or more; e? - optional; &e - lookahead without consumption; !e - negative lookahead. These operators sit in a fixed hierarchy and guarantee deterministic parsing. Modern PEG-based parsers: Python's PEP 617 (Python 3.9+) fully uses a PEG parser; ANTLR 4 uses a PEG-like scheme; libraries include lark, pyparsing, peg.js, and parsimmon.

What fundamentally distinguishes PEG from a context-free grammar (CFG)?

Ordered Choice: e1 / e2 Instead of e1 | e2

The principal difference between PEG and CFG is the / operator in place of |. In CFG alternatives are commutative: A | B and B | A are equivalent. In PEG the / operation is ordered: the parser tries the left alternative first, and only on its failure tries the right. That delivers strict priority and removes ambiguity - but it forces grammar authors to think about order. A classic trap: A <- 'if' Expr 'then' Stmt / 'if' Expr 'then' Stmt 'else' Stmt - the parser always picks the short rule and never reaches the else branch. The fix: longer alternative first.

Negative lookahead (!e) is the second key feature of PEG, absent in CFG. It allows expressing 'if X follows, do not apply this rule' without state-counting or analysis. Example: an identifier is a sequence of letters that is not a keyword: Ident <- !Keyword [a-zA-Z]+. Without negative lookahead such checks need separate post-filters. With it the grammar becomes self-contained and readable.

In a PEG grammar, the order of alternatives in e1 / e2:

Memoization: The Key to Linearity

A naive PEG parser has exponential worst-case complexity. The cause is backtracking: a grammar like A <- a A / a on failure enumerates every way to parse the same position. The remedy has been known since the 1970s: memoization. If parsing rule P at position i has already been computed (success/failure, new position, AST), a repeat call returns the cache. The number of unique (rule, position) pairs is exactly N*K for N input symbols and K rules - so the whole parse is O(N*K). With a fixed grammar this is O(N): linear in the input size.

Memoization spends memory: a table of size N*K. For most languages that is tolerable (thousands of rules times megabytes of text equals tens of megabytes). Real-world Packrat parsers use lazy memoization: caching is applied only to rules where backtracking is possible. ANTLR 4 and Python's PEP 617 use selective memoization - the cache exists only for potentially ambiguous rules, lowering overhead. Left recursion is not supported by classical Packrat and requires special handling (Warth et al., 2008).

Why does memoization give a PEG parser linear time in the input?

The Packrat Parser: PEG + Memoization in Practice

The Packrat parser is the name Bryan Ford coined for a PEG with full memoization. Literally a 'packrat' - by analogy with the animal that stashes everything it finds in its burrow. In practice a Packrat parser is a set of mutually recursive functions (one per rule), each wrapped in memo. It builds an AST in linear time, guarantees the absence of ambiguity, and is easily extended with new rules. Drawbacks: higher memory consumption and poor handling of left recursion (the grammar A <- A '+' B / B loops forever without special support).

Modern Packrat implementations: PEG.js / Peggy for JavaScript, lark for Python, parsimmon for functional composition, and ohm.js (Resilient Software) with an interactive grammar editor. Python's PEP 617 is an industrial proof that PEG scales: the Python 3.9+ core is parsed by the new PEG parser, replacing pgen 1.0. The grammar became simpler, diagnostics more precise, and f-strings and pattern matching are supported without hacks - a feature only PEG made possible.

PEG is just nicer-looking BNF, and any BNF grammar translates directly to PEG

PEG has different semantics: ordered choice instead of commutative alternation. A direct BNF-to-PEG translation often changes the language accepted by the grammar

A classic example is dangling-else: BNF accepts both readings (ambiguous), while PEG rigidly commits to whichever alternative comes first. Translating a grammar from BNF to PEG requires deliberate ordering of alternatives by priority and verification that the language remains the same

The grammar A <- A '+' B / B in a classical Packrat parser leads to:

Key Takeaways

  • **PEG defines a parsing algorithm, not a set of strings** - ambiguity is impossible by construction.
  • **Ordered choice e1 / e2** replaces commutative | in CFG: the order of alternatives is critical; the longest usually goes first.
  • **Memoization makes PEG linear:** each (rule, position) pair is processed exactly once - O(N*K) in the input.
  • **The Packrat parser is PEG + memoization** in real code; the canonical industrial example is PEP 617 in the Python core.
  • **Left recursion is unsupported by classical Packrat** - special extensions (Warth seed parsing) are required.

Related Topics

PEG stands alongside traditional LL/LR approaches to syntactic analysis.

  • LR Parsing — LR works on CFGs and resolves shift/reduce conflicts; PEG is deterministic by construction
  • Formal Grammars — Different expressive classes: CFG vs PEG - neither a subset nor a superset of the other

Вопросы для размышления

  • Python switched from pgen to PEG in 2020 and gained pattern matching. Which other language constructs were out of reach under LL(1) parsing and now become possible with PEG?
  • Left recursion needs Warth-style seed parsing. What trade-offs does this introduce compared with the automatic support found in LR parsers?
  • In PEG the order of alternatives defines the language. Which practices help a team catch cases where e1 'eats' a prefix before reaching the right e2 during early grammar development?

Связанные уроки

  • comp-12-lr-parsing — LR parsing - context for understanding why PEG was invented
  • comp-11-recursive-descent — PEG is recursive descent with memoization
  • comp-14-parser-generators — Packrat underlies PEG-based generators like pest and PEG.js
  • alg-21-dp — Packrat memoization is tabular DP on the input string
  • fl-03-grammars
PEG and Packrat Parsing

0

1

Sign In