Compilers

Context-Free Grammars

How does a computer know that `2 + 3 * 4 = 14`, not 20? And that `if (x) y; else z;` is a single construct, not two? The answer is not in parser code - it is in a formal grammar. BNF was invented in 1960 for ALGOL, and since then every programming language has started with it. A grammar is not documentation; it is a precise mathematical specification of syntax.

  • **The C++ standard** includes a complete BNF grammar of the language - the foundation for all compilers (GCC, Clang, MSVC)
  • **JSON** is described by a few dozen BNF rules - every JSON parser in the world follows the same grammar
  • **SQL** - ISO standard 9075 defines syntax through EBNF; all database systems implement it (with extensions)
  • **Language servers (LSP)** - IDE autocompletion is built on the same grammars that compilers use

BNF: a formal language for describing languages

In 1960, John Backus described the syntax of ALGOL 60 - not in English with all its ambiguities, but in a formal notation. Peter Naur refined it for the final report. The result was **BNF** (Backus-Naur Form), a notation that has been the standard for describing programming language syntax for over 60 years.

BNF consists of **production rules**. Each rule has the form `<nonterminal> ::= expression`. Nonterminals (in angle brackets) are abstractions that can be expanded. Terminals are literal characters. The `|` symbol means 'or'. Parsing begins from the start nonterminal.

The grammar structure encodes **operator precedence**. `<term>` is nested inside `<expr>`, so multiplication has higher precedence than addition. This is not accidental - it is how programming language grammars are intentionally designed.

**Formally:** a context-free grammar (CFG) is a 4-tuple G = (V, T, P, S), where V is the set of nonterminals, T is the set of terminals, P is the set of production rules (A -> alpha), and S in V is the start symbol. 'Context-free' means: a rule can be applied to A regardless of what appears to its left or right.

BNF grammar: `<digit> ::= '0' | '1' | ... | '9'`, `<number> ::= <digit> | <number> <digit>`. Which string does this grammar NOT derive?

EBNF: extensions for readability

BNF is powerful but verbose. Describing 'zero or more comma-separated arguments' (`a, b, c, ...`) requires two rules and recursion. **EBNF** (Extended BNF) adds syntactic sugar that makes grammars shorter and closer to readable documentation.

EBNF is syntactic sugar over BNF. Any EBNF grammar can be mechanically transformed into pure BNF. `{ X }` becomes two rules with an empty alternative and recursion. The key point: **expressive power does not change**, only the convenience of notation.

**Standards:** ISO 14977 defines the 'official' EBNF. In practice, every tool (ANTLR, Bison, PEG.js) uses its own dialect. The meaning is the same; the syntax differs slightly.

EBNF: `list = '[' [ item { ',' item } ] ']' ;`. Which string does NOT match this grammar?

Derivations: how a grammar generates strings

A grammar is not just a description - it is a **generator**. Starting from the start nonterminal and repeatedly applying rules, any valid string of the language can be produced. Such a sequence of substitutions is called a **derivation**.

The **language** of a grammar L(G) is the set of all strings that can be derived from the start symbol. This is precisely what the syntax of a programming language means: a program is syntactically correct if and only if a derivation from the start symbol exists for it.

In a leftmost derivation of `a + b * c` from grammar `E -> E+T | T`, `T -> T*F | F`, `F -> id` - which nonterminal is replaced in the first step?

Parse trees and ambiguity

A derivation is a sequence of steps. A **parse tree** (also called a concrete syntax tree) is its visualization: the root is the start symbol, leaves are terminals, and internal nodes are nonterminals. Each edge corresponds to applying one production rule.

An **ambiguous grammar** is one for which a single string has two or more distinct parse trees. This is a disaster for a compiler: given `1 - 2 - 3`, should it compute `(1 - 2) - 3 = -4` or `1 - (2 - 3) = 2`? Associativity is not a mathematical fact - it is a property encoded in grammar structure.

**Inherently ambiguous languages:** some languages are ambiguous by nature - no unambiguous CFG exists for them. Programming languages are specifically designed to be unambiguous: this is a design requirement, not a tool limitation.

If a grammar is ambiguous, the parser simply picks one of the trees and everything works fine.

Ambiguity means different parsers can pick different trees for the same string - the program behaves differently depending on the parser implementation.

Yacc/Bison resolve conflicts with heuristics (shift over reduce, first rule wins). This gives a reproducible result within one parser but does not guarantee the 'correct' result. The right fix is to eliminate ambiguity in the grammar itself.

Grammar `E -> E - E | id` is ambiguous. What property of `1 - 2 - 3` is determined by the shape of the parse tree?

Context-Free Grammars

  • BNF: notation `<A> ::= beta` - a nonterminal expands into a sequence of symbols
  • EBNF: adds `{ }` (repetition) and `[ ]` (optional), simplifies notation without changing expressive power
  • Derivation: a chain of substitutions from the start symbol to a string; leftmost/rightmost determines parser strategy
  • Parse tree: a visualization of a derivation; structure encodes precedence and associativity
  • Ambiguity: one string with two trees means disaster; resolved by introducing a hierarchy of nonterminals

Related topics

CFG is the theoretical foundation for all practical parsing work.

  • Lexer generators — The lexer supplies tokens; the parser builds a tree from tokens according to the CFG
  • LL parsers — Top-down parsers implement leftmost derivation over a CFG
  • LR parsers — Bottom-up parsers construct a rightmost derivation in reverse order

Вопросы для размышления

  • Why can't operator associativity simply be declared in a single grammar rule - why do extra nonterminals have to be introduced?
  • The C++ grammar is ambiguous in certain places (the 'most vexing parse'). How do compilers handle that?
  • What is the difference between a context-free grammar and a context-sensitive grammar, and which constructs in programming languages require stepping outside CFG?

Связанные уроки

  • fl-12-cfg — Formal CFG theory from formal languages applied to parsing
  • comp-09-lexer-generators — Tokens feed into the grammar rules
  • comp-11-recursive-descent — Recursive descent directly implements LL grammars
  • comp-12-lr-parsing — LR parsing works with a broader class of CFGs
  • fl-02-chomsky — Chomsky hierarchy places CFGs in context
  • plt-25-parser
Context-Free Grammars

0

1

Sign In