Formal Languages
Pumping Lemma for Regular Languages
You know that a DFA recognizes regular languages. But how do you prove that some language is NOT regular? That no finite automaton in the world can handle it? You cannot enumerate all possible automata - there are infinitely many. What is needed is a "non-regularity detector" - a mathematical tool that catches languages too complex for finite automata. That tool is the Pumping Lemma, and it is based on a simple idea: if an automaton has finite memory, long strings create cycles.
- **Parser design:** the Pumping Lemma proves that HTML, XML, and JSON cannot be correctly parsed with regular expressions - context-free grammars are needed
- **FAANG interviews:** Pumping Lemma problems are standard for positions related to compilers and language technologies
- **Tool selection:** understanding the limits of regex helps decide when grep is sufficient and when a full parser is needed
Предварительные знания
Intuition: Cycles in Automata
Consider a stroll through an old town with 10 intersections. A route that passes through 11 intersections has DEFINITELY revisited one of them. This is the **pigeonhole principle**: if n+1 objects are placed in n boxes, at least one box will contain two.
**Key observation:** A DFA with n states, reading a string of length ≥ n, must visit some state twice. And visiting a state twice means traversing a CYCLE. A cycle can be traversed 0 times, 1 time, 2 times, 100 times... and the automaton will still accept (or reject) the string!
This is what "pumping" means: we take the part of the string that corresponds to the cycle and "pump" it - repeat it as many times as we like. If the language is regular, the automaton must accept all pumped versions.
Remember the analogy: **pumping = running in circles**. If a road has a circular section, it can be traversed any number of times - and the finish is still reached (or not, if it would not have been reached without the loop).
A DFA has 5 states. What is the minimum string length that GUARANTEES a cycle in the processing run?
Formal Statement
Now let us translate the intuition into rigorous mathematics. The statement of the Pumping Lemma looks intimidating because of nested quantifiers, but each quantifier has a clear meaning. Let us examine them one by one.
**Pumping Lemma (for regular languages):** If L is a regular language, then ∃ p ≥ 1 (pumping length) such that ∀ w ∈ L, |w| ≥ p: ∃ a decomposition w = xyz, where: 1. |y| > 0 (the pumpable part is non-empty) 2. |xy| ≤ p (the cycle lies in the first p symbols) 3. ∀ i ≥ 0: xy^i z ∈ L (pumping preserves membership)
Let us examine each quantifier and condition:
| Element | Formula | Meaning | Who chooses |
|---|---|---|---|
| Pumping length | ∃ p ≥ 1 | There exists a threshold p (≈ number of states in the DFA) | The lemma (we do not choose!) |
| Long string | ∀ w ∈ L, |w| ≥ p | We take ANY string from L of length ≥ p | Us (in the proof) |
| Decomposition | ∃ xyz = w | There EXISTS SOME decomposition into three parts | The lemma (we do not choose!) |
| Non-empty cycle | |y| > 0 | The pumpable part y is non-empty | Guarantee |
| Cycle at the start | |xy| ≤ p | The cycle lies within the first p symbols | Guarantee |
| Pumping holds | ∀ i ≥ 0: xy^iz ∈ L | Repeating y any number of times keeps the string in L | Guarantee |
**Critically important:** the Pumping Lemma is a NECESSARY but NOT SUFFICIENT condition for regularity. From "L is regular" it follows that "L is pumpable". But from "L is pumpable" it does NOT follow that "L is regular"! This is a one-directional implication.
The condition |xy| ≤ p is often overlooked, but it is critical. It says: the cycle we found lies **at the beginning** of the string, among the first p symbols. This severely constrains the possible decompositions and makes the lemma more powerful for proofs.
The Pumping Lemma guarantees a decomposition w = xyz with |xy| ≤ p. For the string w = aⁿbⁿ with sufficiently large n, what does y consist of?
Proofs of Non-Regularity
The Pumping Lemma is our "non-regularity detector". To prove that a language L is non-regular, we use **proof by contradiction**: we assume L is regular, apply the lemma, and arrive at a contradiction.
**Proof template (proof by contradiction):** 1. Assume L is regular 2. Then by the Pumping Lemma there exists p ≥ 1 3. Choose a specific string w ∈ L, |w| ≥ p (our choice!) 4. Consider ANY decomposition w = xyz with |y| > 0, |xy| ≤ p 5. Find an i such that xy^iz ∉ L 6. Contradiction with the lemma → L is non-regular
**Example 1: L = {aⁿbⁿ | n ≥ 0}** - a classic of formal language theory.
**Example 2: L = {ww | w ∈ {a,b}*}** - the language of "doubled" strings.
**Example 3: L = {1ⁿ | n is prime}** - unary strings of prime length.
**Strategy for choosing w:** select a string that is perfectly balanced at the edge. For {aⁿbⁿ} take aᵖbᵖ - a perfect balance that breaks under pumping. For {1ⁿ | n prime} take 1^q with prime q - pumping produces composite numbers.
In the proof of non-regularity of L = {aⁿbⁿ} we chose w = aᵖbᵖ and took i = 0. Why does i = 0 work for ANY decomposition?
Limitations of the Pumping Lemma
The Pumping Lemma has limits. There exist non-regular languages that satisfy the Pumping Lemma! This means: the lemma can prove non-regularity, but it cannot prove regularity.
**Pumping Lemma - a one-sided test:** - Does not pump → definitely NON-REGULAR ✓ - Pumps → we know NOTHING (may be regular or may not be) ✗
Consider a tricky example. The language L = {aⁱbʲ | i ≠ j} is non-regular (the count of 'a' does not equal the count of 'b'). But it satisfies the Pumping Lemma!
**The Myhill-Nerode Theorem** is a complete criterion for regularity (both necessary and sufficient). A language L is regular ⟺ the Myhill-Nerode equivalence relation has a finite number of classes.
Why is the Myhill-Nerode theorem stronger? Because it provides a criterion **in both directions**: finite number of classes ⟺ regular. Whereas the Pumping Lemma works in only one direction.
| Method | What it proves | Type of criterion | Difficulty of application |
|---|---|---|---|
| Pumping Lemma | Non-regularity | Necessary (one-sided) | Medium - requires choosing w and i |
| Myhill-Nerode | Regularity OR non-regularity | Necessary and sufficient | High - requires analyzing classes |
| Closure | Non-regularity | Combination with known languages | Medium - requires a good decomposition |
| DFA construction | Regularity | Constructive | Depends on the language |
- Pumping Lemma — A quick test: if it doesn't pump → non-regular. But it may miss non-regular languages that happen to pump. Good for simple cases (aⁿbⁿ, ww, prime numbers).
- Myhill-Nerode — A complete criterion: L is regular ⟺ finite number of classes. More powerful, but harder to apply - requires analyzing an infinite set of strings. Ideal for difficult cases.
**When to use which:** start with the Pumping Lemma - if a suitable w and i can be found, the job is done. If PL does not help (the language pumps), move to Myhill-Nerode. Closure properties are useful when the language is defined through operations on other languages.
If a language satisfies the Pumping Lemma, it is regular
The Pumping Lemma is a NECESSARY condition for regularity. From "L pumps" it does NOT follow that "L is regular". For a complete check the Myhill-Nerode theorem is needed.
The lemma is stated as an implication: "regular → pumps". The converse implication "pumps → regular" is FALSE. Counterexample: L = {aⁱbʲ | i ≠ j} pumps but is non-regular.
Key Ideas
- **Pigeonhole principle + DFA:** a string of length ≥ n in an automaton with n states creates a cycle. The cycle can be "pumped" - repeated 0, 1, 2, ... times
- **Statement:** for regular L there exists p such that any w ∈ L (|w| ≥ p) decomposes as xyz (|y| > 0, |xy| ≤ p), and xy^iz ∈ L for all i ≥ 0
- **Proof by contradiction:** assume regularity, choose a clever string w, show that pumping takes us out of L → contradiction → non-regular
- **Limitations:** PL is a necessary but NOT sufficient condition. For a complete criterion use Myhill-Nerode
Related Topics
The Pumping Lemma connects automata theory with the limits of computability and the practice of parsing:
- DFA Minimization — Pumping length p = number of states in the minimal DFA. Minimization gives the exact p
- Context-Free Grammars — Languages such as {aⁿbⁿ}, shown non-regular by PL, are recognized by CFGs. The PL for CFGs is analogous but with two "pumpable" parts
- NFA → DFA Conversion — Subset construction yields a DFA whose state count determines the pumping length
Вопросы для размышления
- Why is the condition |xy| ≤ p important in the lemma? How would the proofs change without it?
- Can you think of a non-regular language for which the Pumping Lemma fails to prove non-regularity?
- The Pumping Lemma for context-free languages has TWO pumpable parts (uvwxy instead of xyz). Why is one not enough for CFGs?
Связанные уроки
- fl-06-dfa — Pumping lemma proof uses DFA pigeonhole on states
- fl-10-dfa-minimization — Minimal DFA makes the pigeonhole argument tight
- fl-12-cfg — Context-free pumping lemma is an analogous tool for CFLs
- fl-02-chomsky — Hierarchy placement of regular languages provides motivation
- ml-02