Compilers

Parser Generators: Bison, ANTLR, tree-sitter

Type a new line in VS Code and a small miracle happens: the parser updates the AST of a 50,000-line file in 2 milliseconds and recolors the syntax. Meanwhile GCC compiles a C file using a parser written in a Bison grammar that is 30 years old. Two worlds of parsers with two sets of requirements.

**PostgreSQL** runs on 50,000 lines of Bison grammar where every line is a negotiation with LALR conflicts; one of the project's oldest parsers
**Hibernate HQL** is written in ANTLR; over 20 years the parser has processed billions of queries from Java applications
**GitHub Code Search 2024** uses tree-sitter to build indexes over 200M+ repositories - incrementality is critical at that scale

Yacc and Bison

In 1975 Stephen Johnson at Bell Labs wrote Yacc - Yet Another Compiler-Compiler. The idea: instead of a hand-written C parser, the compiler itself generates LALR(1) tables from a grammar. Ten years later GNU **Bison** appeared - a compatible replacement with extensions. Bison powers the parsers of GCC, PostgreSQL, and Ruby; one `.y` file becomes a ready-made `parser.c`.

A Bison grammar is BNF rules plus semantic actions (C code inside braces). At parse time Bison builds a **parse table** and operates as a stack-driven state machine. Shift/reduce and reduce/reduce conflicts are a familiar pain: the developer rewrites the grammar to eliminate them or declares explicit precedence via `%left`, `%right`, `%nonassoc`.

**LALR(1) vs LR(1)**: LALR(1) merges LR(1) states with the same core, giving fewer states (thousands instead of tens of thousands) but occasionally creating spurious reduce-reduce conflicts on grammars that LR(1) would parse cleanly. Modern Bison supports both.

A shift/reduce conflict appears in a Bison grammar for the if-else rule. What is the standard fix?

ANTLR: LL(*) and strong typing

Bison forces thinking in terms of LALR states and shift/reduce conflicts, which is hard even for veterans. Terence Parr designed **ANTLR** - a top-down parser generator with the **LL(*)** algorithm (adaptive lookahead). LL parsers read like recursive descent, which simplifies debugging and refactoring.

ANTLR powers: the Hibernate HQL parser, IntelliJ plugins for various languages, Twitter Search Query parser, the SQL standard parser in Presto/Trino. It generates Java, C#, Python, Go, and more. The key feature is **adaptive prediction**: each alternative chooses exactly as much lookahead as needed (any number of symbols, unlike LL(1)).

**Visitor vs Listener pattern**: ANTLR generates both interfaces. The Listener walks the AST automatically through `walk()` and is convenient for side effects. The Visitor returns values and is used for interpretation/transformation.

In ANTLR the grammar `expr: expr '+' expr | INT` compiles, while an equivalent rule fails in Bison. Why?

tree-sitter: parsing for IDEs

Inside an IDE the parser runs on **every keystroke**. A regular LALR parser cannot keep up: reparsing a 10,000-line file takes hundreds of milliseconds. Atom/GitHub released **tree-sitter** in 2018 - an incremental parser that updates the AST in 1-5 ms by reparsing only the edited region.

tree-sitter uses a **GLR (Generalized LR)** parser with error recovery: it lives with syntax errors (essential for IDEs where code is always invalid mid-edit). A single AST contains both valid nodes and error nodes. It is used in Neovim, GitHub code search, Helix, and many language servers. Grammars are written in JavaScript (a DSL).

**tree-sitter in Neovim**: while a file is edited, only the affected window is reparsed; the rest of the AST is reused. This delivers highlighting and folding without lag even on 100k-line files.

Why does tree-sitter fit IDEs while Bison or ANTLR does not?

Error recovery in parsers

A C compiler from the 1980s reports one error at a time - and that is painful: fix, recompile, see the next one, half an hour later everything works. Modern compilers (rustc, clang, TypeScript) show **dozens of errors in a single pass**. That requires **error recovery**: after an error, the parser does not crash but finds a safe synchronization point and continues.

Classic strategies: **panic mode** (skip tokens until a marker like `;` or `}`), **phrase-level recovery** (insert missing tokens), **error productions** (grammar rules modeling common mistakes, e.g. `expr: expr '+' /* missing */`). tree-sitter and Roslyn (C#) use ML-driven recovery - picking a fix from a probabilistic model.

**Why this matters for language servers**: VS Code and IntelliJ underline errors in real time. If the parser crashes on the first error, the rest of the code loses highlighting and autocomplete. Error recovery delivers the UX of Word with grammar checking.

Bison is obsolete, tree-sitter is better for everything.

Bison is optimal for compilers (single pass, clean input), tree-sitter is for IDEs (incrementality plus error recovery); ANTLR is the general-purpose middle ground with strong tooling.

Each tool is optimized for a different workload profile. Tree-sitter trades cold-parse performance for reparse speed; Bison does the opposite.

When parsing `let x = ; let y = 5;` the compiler should report both errors. Which approach makes this possible?

Key Ideas

**Bison (LALR)** is the time-tested choice for compilers: single pass, efficient tables, but shift/reduce conflicts require experience
**ANTLR (LL(*))** is easier to work with: recursive descent reads like code, left recursion is rewritten automatically, Visitor/Listener built in
**tree-sitter (GLR + incremental)** is optimized for IDEs: incremental reparse in milliseconds plus error recovery with ERROR nodes in the AST
**Error recovery** is not optional in modern compilers: panic mode, phrase-level recovery, and error productions deliver the 'all errors at once' UX
Parser choice = workload choice: single pass over clean input vs thousands of reparses over a constantly changing file

Вопросы для размышления

If PostgreSQL were rewritten from scratch today, would the parser still use Bison? Compatibility or genuinely the best fit for a DBMS?
tree-sitter trades AST purity (ERROR nodes) for speed. In what scenarios is this trade-off unacceptable?
Why do compilers like rustc and clang use hand-written parsers instead of generators? Control over error messages, or something else?

Связанные уроки

comp-13-peg-packrat — PEG is one of the formal bases for generators like pest
comp-12-lr-parsing — Bison generates LALR parsers - LR must be understood first
comp-11-recursive-descent — ANTLR builds LL(*) recursive descent
comp-15-pratt-parser — Pratt is the handwritten alternative - same trade-offs
alg-01-big-o — Parse time depends on grammar class - Big O context
fl-12-cfg