Formal Languages

Pumping Lemma for Context-Free Languages

How do you prove that a language is NOT context-free? This resembles proving absence: one cannot simply show that no grammar works. A universal tool is needed - and that tool is the pumping lemma.

  • **Compiler theory**: proving that C is not CFL (due to scope rules) explains why C compilers have a separate symbol table lookup phase instead of pure parsing
  • **Transformers and formal languages**: papers by Hahn et al. (2020), Bhattamishra et al. (2020) show that the attention mechanism theoretically exceeds CFL - partly explaining GPT's syntactic analysis capability
  • **Bioinformatics**: RNA secondary structure prediction is described by CFL grammars; tertiary structure goes beyond CFL. The pumping lemma was applied to prove these boundaries

Pumping Lemma: Structure of Long Strings

The language $\{a^n b^n c^n | n \geq 0\}$ is not context-free. How to prove this? No algorithm can track three counters simultaneously without memory. But that is intuition, not a proof. The pumping lemma provides the formal tool.

The idea: if a CFL is infinite, it contains a 'repeating structure'. In long strings there must be a place for 'pumping' - repeating part of the string any number of times while preserving membership in the language. If no such place exists - the language is not context-free.

Where does the uvwxy split come from? From the parse tree! For a long string the parse tree must be tall. By the path lemma: if height > |V| (number of nonterminals), some nonterminal repeats on the path from root to leaf. This repetition creates the 'pumpable' cycle - exactly like a repeated state in a finite automaton.

Why must a repeated nonterminal appear on the path from root to leaf in a parse tree for sufficiently long strings?

Ogden's Lemma: A Stronger Tool

Ogden's Lemma (1968) strengthens the pumping lemma. One can 'mark' positions in the string and guarantee that the pumpable parts v and x each contain at least one marked position. This makes some proofs considerably easier - in cases where the pumping lemma gives too much freedom in choosing the split.

Example: the language $\{a^i b^j c^k | i = j$ or $j = k\}$. Proving non-CFL with the standard lemma is hard due to the disjunctive condition. Ogden allows focusing on the b-positions, which must participate in pumping, and deriving a contradiction with one of the conditions.

What is the main advantage of Ogden's Lemma over the standard pumping lemma?

Examples of Non-CFL Languages

Three canonical non-CFL languages: $\{a^n b^n c^n\}$ (three counters), $\{ww | w \in \Sigma^*\}$ (string doubling), $\{a^{n^2}\}$ (squares). All require simultaneously tracking dependencies that a context-free grammar cannot express.

ML connection: the attention mechanism in transformers theoretically models languages beyond CFL. Papers from 2019-2022 proved that single-layer transformers with one head are equivalent to Star-Free regular languages. Multi-head attention reaches beyond CFL - this partly explains why transformers recognize syntactic dependencies across arbitrary distances.

Why is the language {ww | w in {a,b}*} (string concatenated with itself) not context-free?

Closure Properties of Context-Free Languages

Context-free languages are closed under: union (L1 U L2), concatenation (L1 L2), Kleene star (L*), homomorphism, reversal. They are NOT closed under: intersection (L1 ∩ L2), complement. This is fundamental for building parsers - grammars cannot be 'intersected' arbitrarily.

Practical consequence: many real programming languages are not context-free. Variables must be declared before use - this is an intersection of two conditions. The solution: separate CFL syntax (parsing) from semantic analysis (type checking, scope resolution). Rust ownership rules require even more expressive formalisms.

The intersection of a CFL with a regular language is CFL. This matters: regular expressions can 'filter' a context-free language without leaving the class. For example, the subset of Python with only simple expressions (no loops) remains CFL. In compilers - lexer (regular) + parser (CFL) cooperate on exactly this basis.

If a programming language is defined by a CFL grammar in BNF, the entire language is context-free

BNF describes only syntax. Semantic constraints (typing, scope, ownership) go beyond CFL and are checked by a separate analyzer

For example, verifying that a variable is declared before use is an intersection of the CFL with a context-dependent condition. By the non-closure of intersection, this is not necessarily CFL. This is exactly why parsing and semantic analysis are separate phases in compilers

Why is the intersection of two CFLs not necessarily CFL?

Related Topics

The pumping lemma completes the theory of context-free languages:

  • Pumping Lemma for Regular Languages — The simpler predecessor of the CFL pumping lemma
  • CYK Algorithm — Parse tree structure used in the proof of the lemma
  • Decidability — Understanding CFL boundaries leads to questions about what is decidable at all

Key Ideas

  • **Pumping Lemma for CFG**: for an infinite CFL any long string s=uvwxy can be pumped (uv^i wx^i y in L). |vwx|<=p, |vx|>=1.
  • **Source of the split**: a repeated nonterminal in the parse tree (pigeonhole on a long path). Analogous to a repeated state in a finite automaton.
  • **Ogden's Lemma**: a strengthening - allows 'marking' positions and guaranteeing they fall in the pumpable parts. Useful for complex disjunctive conditions.
  • **Closure**: CFLs are closed under union, concatenation, Kleene star. NOT closed under intersection and complement. Consequence: syntax != semantics in compilers.

Вопросы для размышления

  • The language {a^n b^n} is CFL. {a^n b^n c^n} is not. Where exactly is the boundary? What determines the 'power' of a grammar?
  • Attention in transformers theoretically exceeds CFL. Does this mean GPT 'understands' syntax better than traditional parsers?
  • Programming languages are deliberately designed to be CFL-parseable. What syntactic restrictions exist precisely because of this requirement?

Связанные уроки

  • fl-11-pumping-lemma — The regular pumping lemma is a simpler analogue for regular languages
  • fl-22-decidability — Understanding CFL boundaries leads to decidability questions
  • alg-22-backtracking — Pumping lemma proofs are structured backtracking against an adversary
  • fl-15-cyk — CYK and parse trees are needed to understand the pumping lemma for CFG
  • ml-02
Pumping Lemma for Context-Free Languages

0

1

Sign In