Formal Languages
Normal Forms: CNF and GNF
Take the arithmetic expression grammar with its ε-rules, long tails, and unit rules. A parsing algorithm looks at it and gives up: too many rule shapes, too many special cases. Chomsky in 1959 and Greibach in 1965 proposed a fix - standardize the form, the way industry standardized electrical outlets. With a predictable shape, one universal algorithm becomes possible - CYK, for example.
- **CYK algorithm** requires CNF as input; bioinformatics uses CYK to parse RNA secondary structures
- **Stochastic CFG in NLP** - PCFG parsers (Charniak, Stanford Parser) internally rely on CNF grammars
- **Theory of computation courses** - the pumping lemma proof for CFLs goes through a CNF derivation tree
Chomsky Normal Form (CNF)
The grammar `S -> aSb | ε` is short but messy: ε-rules, long right-hand sides, terminals and non-terminals interleaved. Parsing algorithms like CYK need a predictable shape. **CNF (Chomsky Normal Form)** allows only two rule forms: `A -> BC` (two non-terminals) or `A -> a` (one terminal). A single ε-rule on the start symbol is allowed when the language contains the empty string.
Every CFG can be rewritten in CNF. This means the class of context-free languages is independent of rule shape. Theoretical value: cleaner proofs (pumping lemma, decidability). Practical value: the CYK algorithm runs on CNF grammars in O(n^3).
Which rule is NOT a valid Chomsky Normal Form rule?
Algorithm: Converting to CNF
Conversion to CNF is a sequence of four steps, each eliminating one kind of impurity. The order matters: ε-rules must be removed before unit rules, otherwise new ε-rules would reappear. Each step preserves the language but changes the rules.
- Introduce a new start symbol S' -> S so that the old start never appears on the right
- Remove ε-rules (except S' -> ε for the empty string): find all nullable non-terminals and duplicate rules with and without them
- Remove unit rules A -> B: replace each unit rule by the productions of B inlined into A
- Break long right-hand sides A -> X1 X2 X3 via auxiliary non-terminals: A -> X1 Y1, Y1 -> X2 X3. Replace terminals occurring inside longer rules by fresh non-terminals
Why does the order of steps in the CNF algorithm matter?
Greibach Normal Form (GNF)
**GNF (Greibach Normal Form, 1965)** is an alternative normalization: every rule has the form `A -> a α`, where `a` is a terminal and `α` is a (possibly empty) string of non-terminals. The first symbol of every right-hand side is a terminal. This gives a strong guarantee: each rule application immediately produces one input symbol.
GNF is well-suited for constructing a PDA from a grammar: a rule `A -> a B C D` translates directly to a PDA transition that reads `a` and pushes `B C D`. The grammar-to-PDA correspondence becomes transparent. GNF also guarantees the absence of left recursion - critical for recursive-descent parsing.
What is common between CNF and GNF, and what is the difference?
Use: CYK and Theoretical Analysis
CNF is the input format for the CYK algorithm (covered in the next lesson), which decides membership in a CFL in O(n^3 * |G|). The algorithm fills a table: cell `[i, j]` records the set of non-terminals deriving substring `s[i..j]`. A clean RHS of form `BC` lets the algorithm consider every rule as a pair of left prefix and right suffix.
The pumping lemma for CFLs (Bar-Hillel, 1961) is formulated through CNF: a long string in the language contains a fragment `uvwxy` where `vwx` is shorter than a constant, and the pumped string `uv^i w x^i y` remains in the language. The proof relies on a derivation tree in a CNF grammar of depth > log |s|.
If a grammar is in CNF, parsing it is always faster
CNF enables the O(n^3) CYK algorithm, but any LL/LR parser on the original grammar (after left-recursion elimination) runs in O(n)
CNF is a tool for theoretical analysis and universal parsing algorithms. Industrial parsers operate on restricted CFG subclasses (LL(k), LR(k)) and run on them linearly. CNF is not an optimization - it is a normalization
Why do industrial compilers NOT use CNF for parsing, even though CYK works on CNF?
CNF is universal, but the price of universality is cubic complexity. Industry trades grammar expressiveness for linear parsing time.
Key Ideas
- **CNF** - rules of form `A -> BC` or `A -> a`; the standard input for the CYK algorithm
- **GNF** - rule `A -> aα`, terminal first; ideal for PDA construction and left-recursion elimination
- **CNF algorithm** - four steps: new start, ε-rules, unit rules, splitting long rules - order matters
- **Use cases** - CNF and GNF live in theory (pumping lemma) and universal parsers (CYK, Earley); industrial compilers prefer LL/LR on the original grammar
Related Topics
Back to the motivation: a universal algorithm demands a standard form. That idea links several upcoming topics:
- Pushdown Automata (PDA) — GNF translates directly into a PDA: rule `A -> aBCD` becomes the transition 'read a, push BCD' - the PDA construction becomes trivial
- CYK Algorithm — CYK requires CNF and runs in O(n^3); CNF is a precondition for using CYK
- Context-Free Grammars — Normal forms prove that the class of CFLs is independent of grammar shape - a foundational theoretical result
Вопросы для размышления
- If CNF makes a grammar less readable while parsing complexity stays O(n^3), why bother with the form beyond proving theorems?
- Could a 'third' normal form exist with properties different from CNF and GNF? What properties would be valuable?
- The pumping lemma is proved via CNF derivation trees. What would the proof look like on the original grammar with long rules?
Связанные уроки
- fl-13-pda — PDA and CFG are equivalent - CFG normalization is needed for both
- fl-15-cyk — CYK works exclusively on CNF grammars
- fl-12-cfg — CFG - the base structure before normalization
- alg-21-dp — CYK on CNF - a classic DP problem on strings
- comp-14-parser-generators