Compilers
Context-Free Grammars
How does a computer know that `2 + 3 * 4 = 14`, not 20? And that `if (x) y; else z;` is a single construct, not two? The answer is not in parser code - it is in a formal grammar. BNF was invented in 1960 for ALGOL, and since then every programming language has started with it. A grammar is not documentation; it is a precise mathematical specification of syntax.
- **The C++ standard** includes a complete BNF grammar of the language - the foundation for all compilers (GCC, Clang, MSVC)
- **JSON** is described by a few dozen BNF rules - every JSON parser in the world follows the same grammar
- **SQL** - ISO standard 9075 defines syntax through EBNF; all database systems implement it (with extensions)
- **Language servers (LSP)** - IDE autocompletion is built on the same grammars that compilers use
BNF: a formal language for describing languages
In 1960, John Backus described the syntax of ALGOL 60 - not in English with all its ambiguities, but in a formal notation. Peter Naur refined it for the final report. The result was **BNF** (Backus-Naur Form), a notation that has been the standard for describing programming language syntax for over 60 years.
BNF consists of **production rules**. Each rule has the form `<nonterminal> ::= expression`. Nonterminals (in angle brackets) are abstractions that can be expanded. Terminals are literal characters. The `|` symbol means 'or'. Parsing begins from the start nonterminal.
The grammar structure encodes **operator precedence**. `<term>` is nested inside `<expr>`, so multiplication has higher precedence than addition. This is not accidental - it is how programming language grammars are intentionally designed.
**Formally:** a context-free grammar (CFG) is a 4-tuple G = (V, T, P, S), where V is the set of nonterminals, T is the set of terminals, P is the set of production rules (A -> alpha), and S in V is the start symbol. 'Context-free' means: a rule can be applied to A regardless of what appears to its left or right.
BNF grammar: `<digit> ::= '0' | '1' | ... | '9'`, `<number> ::= <digit> | <number> <digit>`. Which string does this grammar NOT derive?
EBNF: extensions for readability
BNF is powerful but verbose. Describing 'zero or more comma-separated arguments' (`a, b, c, ...`) requires two rules and recursion. **EBNF** (Extended BNF) adds syntactic sugar that makes grammars shorter and closer to readable documentation.
EBNF is syntactic sugar over BNF. Any EBNF grammar can be mechanically transformed into pure BNF. `{ X }` becomes two rules with an empty alternative and recursion. The key point: **expressive power does not change**, only the convenience of notation.
**Standards:** ISO 14977 defines the 'official' EBNF. In practice, every tool (ANTLR, Bison, PEG.js) uses its own dialect. The meaning is the same; the syntax differs slightly.
EBNF: `list = '[' [ item { ',' item } ] ']' ;`. Which string does NOT match this grammar?
Derivations: how a grammar generates strings
A grammar is not just a description - it is a **generator**. Starting from the start nonterminal and repeatedly applying rules, any valid string of the language can be produced. Such a sequence of substitutions is called a **derivation**.
The **language** of a grammar L(G) is the set of all strings that can be derived from the start symbol. This is precisely what the syntax of a programming language means: a program is syntactically correct if and only if a derivation from the start symbol exists for it.
In a leftmost derivation of `a + b * c` from grammar `E -> E+T | T`, `T -> T*F | F`, `F -> id` - which nonterminal is replaced in the first step?
Parse trees and ambiguity
A derivation is a sequence of steps. A **parse tree** (also called a concrete syntax tree) is its visualization: the root is the start symbol, leaves are terminals, and internal nodes are nonterminals. Each edge corresponds to applying one production rule.
An **ambiguous grammar** is one for which a single string has two or more distinct parse trees. This is a disaster for a compiler: given `1 - 2 - 3`, should it compute `(1 - 2) - 3 = -4` or `1 - (2 - 3) = 2`? Associativity is not a mathematical fact - it is a property encoded in grammar structure.
**Inherently ambiguous languages:** some languages are ambiguous by nature - no unambiguous CFG exists for them. Programming languages are specifically designed to be unambiguous: this is a design requirement, not a tool limitation.
If a grammar is ambiguous, the parser simply picks one of the trees and everything works fine.
Ambiguity means different parsers can pick different trees for the same string - the program behaves differently depending on the parser implementation.
Yacc/Bison resolve conflicts with heuristics (shift over reduce, first rule wins). This gives a reproducible result within one parser but does not guarantee the 'correct' result. The right fix is to eliminate ambiguity in the grammar itself.
Grammar `E -> E - E | id` is ambiguous. What property of `1 - 2 - 3` is determined by the shape of the parse tree?
Context-Free Grammars
- BNF: notation `<A> ::= beta` - a nonterminal expands into a sequence of symbols
- EBNF: adds `{ }` (repetition) and `[ ]` (optional), simplifies notation without changing expressive power
- Derivation: a chain of substitutions from the start symbol to a string; leftmost/rightmost determines parser strategy
- Parse tree: a visualization of a derivation; structure encodes precedence and associativity
- Ambiguity: one string with two trees means disaster; resolved by introducing a hierarchy of nonterminals
Related topics
CFG is the theoretical foundation for all practical parsing work.
- Lexer generators — The lexer supplies tokens; the parser builds a tree from tokens according to the CFG
- LL parsers — Top-down parsers implement leftmost derivation over a CFG
- LR parsers — Bottom-up parsers construct a rightmost derivation in reverse order
Вопросы для размышления
- Why can't operator associativity simply be declared in a single grammar rule - why do extra nonterminals have to be introduced?
- The C++ grammar is ambiguous in certain places (the 'most vexing parse'). How do compilers handle that?
- What is the difference between a context-free grammar and a context-sensitive grammar, and which constructs in programming languages require stepping outside CFG?
Связанные уроки
- fl-12-cfg — Formal CFG theory from formal languages applied to parsing
- comp-09-lexer-generators — Tokens feed into the grammar rules
- comp-11-recursive-descent — Recursive descent directly implements LL grammars
- comp-12-lr-parsing — LR parsing works with a broader class of CFGs
- fl-02-chomsky — Chomsky hierarchy places CFGs in context
- plt-25-parser