Formal Languages

Context-Free Grammars (CFG)

Every time you write `if (x > 0) { return x * 2; }`, the compiler parses that string using the rules of a grammar. That grammar is context-free. Without CFG there would be no programming language, no calculator, and no HTML parser.

  • **Compilers and interpreters** - parsing code in C, Python, JavaScript relies on CFG
  • **HTML/XML parsers** - nested tags `<div><p>...</p></div>` are described by a context-free grammar
  • **Mathematical expressions** - calculators use CFG for correct operator precedence
  • **NLP (natural language processing)** - parsing sentences: subject + predicate + object

Предварительные знания

  • Regex ↔ Automata: Kleene's Theorem

What Is a Context-Free Grammar?

**Regular expressions are powerless** when it comes to checking nested brackets. The string `((()))` is valid, `(()` is not. Regex cannot "count" nesting depth. For such languages a more powerful tool is needed - a **context-free grammar (CFG)**.

A **CFG** is a set of substitution rules that describe how to generate strings of a language from the start symbol. Unlike regular grammars, the right-hand side of a rule may contain **any** combination of terminals and non-terminals.

Formally, **G = (V, Σ, R, S)** is a 4-tuple, where:

ComponentNotationWhat it is
Variables (non-terminals)V

0

1

Sign In

Symbols that are replaced by rules: S, A, B...
TerminalsΣFinal symbols of the language: a, b, +, (, ) ...
Rules (productions)RA set of substitutions of the form A → α
Start symbolSThe non-terminal from which derivation begins

Consider the classic example - the language **{aⁿbⁿ | n ≥ 0}**, which regex cannot recognize:

**Why does this work?** Each application of the rule `S → aSb` adds one `a` on the left and one `b` on the right. The stack of `a` and `b` is always balanced - and that is precisely what regex cannot check.

  • Regular grammar — Right-hand side: only xB or x (one non-terminal on the right). Equivalent to a finite automaton. Cannot count nesting.
  • Context-free grammar — Right-hand side: ANY string from V and Σ (for example, aSb). Can count, nest, mirror. Strictly more powerful.

Which language CANNOT be described by a regular grammar, but CAN be described by a CFG?

Derivations: Step by Step

A **derivation** is a sequence of rule applications that transforms the start symbol into a final string of terminals. At each step we take one non-terminal and replace it according to one of the rules.

Let us take the grammar of arithmetic expressions - the very one used by real compilers:

Let us derive the string `id + id * id` (analogous to `a + b * c`):

**Leftmost derivation** - at each step replace the leftmost non-terminal. **Rightmost derivation** - the rightmost. For an unambiguous grammar both lead to the same parse tree.

The order of steps differs, but the result is the same. Let us write a string generator for the grammar:

Compilers use **rightmost derivation** in reverse - this is the foundation of LR parsing (shift-reduce). Leftmost derivation is used in LL parsers (recursive descent).

In the leftmost derivation of `id * id` using the grammar E→T, T→T*F|F, F→id - what is the SECOND step?

Parse Trees

A **parse tree** is a visualization of a derivation. It shows which rules were applied and in what structure. The root is the start symbol, leaves are terminals, and internal nodes are non-terminals.

The parse tree eliminates the ambiguity of step ordering: leftmost and rightmost derivations yield **the same tree**, provided the grammar is unambiguous. The tree captures structure, not ordering.

Let us construct the parse tree for `id + id * id` using grammar G₂:

Notice: the tree **encodes operator precedence**. The multiplication `id * id` sits deeper in the tree (computed first), while addition is closer to the root.

In a compiler the parse tree is transformed into an **AST (Abstract Syntax Tree)** - a simplified version without "extra" nodes. For example, the E, T, F nodes are collapsed, leaving only operators and operands.

What are the LEAVES of a parse tree?

Grammar Ambiguity

A **grammar is ambiguous** if there exists at least one string that has **two or more distinct parse trees**. This is a disaster for a compiler - it does not know how to interpret the expression.

Consider the "naive" expression grammar where everything is collapsed into one non-terminal:

**Why is this dangerous?** For `2 + 3 * 4`, tree 1 yields (2+3)*4 = 20, while tree 2 yields 2+(3*4) = 14. A compiler with an ambiguous grammar will generate unpredictable code!

The solution is to **rewrite the grammar** so that precedence and associativity are "baked into" the rule structure:

ProblemCauseSolution
a + b * c - two treesNo operator precedenceDifferent non-terminal levels: E for +, T for *
a + b + c - two treesNo associativityLeft recursion E → E + T (left-associative)
if-then-else with nested ifDangling elseSplit into matched/unmatched if

Dangling Else - A Bug That Became a Legend

The "dangling else" problem was first described during the development of ALGOL 60 in the 1960s. The statement `if a then if b then s1 else s2` - which if does the else belong to? This ambiguity tormented compiler developers for decades. Languages like Python solved it radically - with mandatory indentation.

**Inherently ambiguous languages** - there exist context-free languages for which **no** unambiguous grammar exists. Example: {aⁿbⁿcᵐ} ∪ {aⁿbᵐcᵐ}. Any CFG for this language will be ambiguous on strings of the form aⁿbⁿcⁿ.

If a grammar is ambiguous, the language is also ambiguous

A language can be unambiguous even if a specific grammar is not. Usually the grammar can be rewritten to eliminate ambiguity

Ambiguity is a property of the grammar, not the language. The exception is inherently ambiguous languages, for which no CFG will be unambiguous

The grammar E → E+E | E*E | id is ambiguous. Which approach SOLVES the precedence problem?

Key Ideas

  • **CFG = (V, Σ, R, S)** - variables, terminals, substitution rules, start symbol
  • **CFG is strictly more powerful** than regular grammars - it can describe {aⁿbⁿ}, nested brackets, and other non-regular languages
  • **Derivation** - a sequence of rule applications: S ⟹ ... ⟹ string
  • **Parse tree** - a visualization of the derivation structure, encoding operator precedence
  • **Ambiguous grammar** - a string has ≥2 parse trees; usually fixed by a hierarchy of non-terminals

Connections to Other Topics

CFG is a bridge between regular languages and full computability:

  • Regular Expressions and Automata — CFG is the next level in the Chomsky hierarchy above regular languages
  • Pushdown Automata (PDA) — PDA is a computational model equivalent in power to CFG
  • CYK Algorithm — CYK is a parsing algorithm for CFG in O(n³), using dynamic programming

Вопросы для размышления

  • Why does the non-terminal hierarchy E → T → F solve the operator precedence problem? What would happen if you added an exponentiation operator - what level would it occupy?
  • The language {aⁿbⁿcⁿ | n ≥ 0} is NOT context-free. Intuitively - why can a CFG not "count" three groups of symbols simultaneously?
  • HTML allows nested tags: <div><div></div></div>. Why is CFG needed to validate HTML rather than regex?

Связанные уроки

  • fl-11-pumping-lemma — Regular languages contrast motivates CFG power
  • fl-02-chomsky — CFGs sit at level 2 of the Chomsky hierarchy
  • comp-10-cfg — Compilers use CFGs for parser specifications
  • fl-13-pda — PDAs are the machine equivalent of CFGs
  • fl-16-pumping-cf — Context-free pumping lemma proves non-context-free languages
Context-Free Grammars (CFG)