Formal Languages

DFA Minimization

You wrote a regular expression to validate email addresses and converted it to a DFA. The result was 47 states. But it turns out 12 are enough. The remaining 35 are "clones" doing the same work. DFA minimization finds and eliminates such duplicates, creating the MOST compact automaton for the given language. And the result is unique - like a fingerprint of the language.

  • **Compilers** (gcc, clang, javac): the lexical analyzer uses a minimized DFA - less memory, faster tokenization
  • **Network filters** (Snort, Suricata): DFAs for intrusion detection are minimized to process traffic in real time
  • **Protocol verification:** comparing two models = DFA minimization + isomorphism check

Предварительные знания

  • NFA → DFA Conversion: Subset Construction

Equivalent States

Imagine a subway map where two stations lead to the same destinations via identical routes. Why keep both? One can be removed by merging the passenger flows. That is exactly what DFA minimization does - it finds **indistinguishable** states and merges them.

**Definition:** Two states q₁ and q₂ are called **equivalent** (q₁ ≡ q₂) if for EVERY string w ∈ Σ* the automaton, starting from q₁, accepts w if and only if, starting from q₂, it also accepts w. In other words - no string in the world can distinguish them.

Think of it this way: you are standing at one intersection in a city. Your friend is at another intersection. If from both intersections any sequence of turns left/right leads to the same outcome (you reach the airport or you don't), then those intersections are **indistinguishable**. The city map can be simplified by merging them into one.

Key insight: checking ALL strings is impossible - there are infinitely many. A systematic approach is needed. Instead of directly verifying equivalence, we will search for **distinguishable** pairs - those for which THERE EXISTS a string that separates them.

Naive enumeration of strings does NOT prove equivalence - it only failed to find a counterexample. For an exact answer, the table-filling algorithm is required, which we will examine next.

TermMeaningAnalogy
Equivalent StatesNo string distinguishes themTwo intersections with identical routes
Distinguishable statesA separating string existsThere is a route where outcomes differ
k-equivalenceIndistinguishable by strings of length ≤ kIdentical on short routes
Full equivalenceIndistinguishable by ANY stringIdentical on ALL routes

States q₁ (accepting) and q₂ (rejecting) in a DFA. Can they be equivalent?

Table-Filling Algorithm (Moore)

The table-filling algorithm is an elegant method for finding ALL distinguishable state pairs. The idea is simple: start with pairs that are **definitely distinguishable** (accept vs reject), and propagate distinguishability backward along transitions.

**Moore Algorithm (table-filling):** 1. Create a table for all pairs (qᵢ, qⱼ) where i < j 2. **Base:** mark as distinguishable all pairs where one state is accepting and the other is not 3. **Iteration:** for each unmarked pair (qᵢ, qⱼ) and each symbol a ∈ Σ: if the pair (δ(qᵢ,a), δ(qⱼ,a)) is already marked as distinguishable → mark (qᵢ, qⱼ) 4. **Repeat** step 3 until no more changes occur 5. Unmarked pairs are **equivalent** states

Let's work through a concrete example. Take the DFA recognizing strings where the number of 'a' symbols is divisible by 3:

**Step 1 (base):** Mark accept/reject pairs: (q0,q1)×, (q0,q2)×, (q1,q3)×, (q2,q3)×, (q3,q4)×, (q0,q4)×

Complexity of the Moore algorithm: O(n² · |Σ|) per iteration, up to n iterations → **O(n³ · |Σ|)** in the worst case. For small automata this is acceptable, but for large ones Hopcroft's algorithm is available.

In the table-filling algorithm, pair (q1, q2) is unmarked. Transition on 'a': δ(q1,a)=q3, δ(q2,a)=q5. Pair (q3,q5) is marked as distinguishable. What will happen?

Hopcroft's Algorithm

Hopcroft's algorithm is the "sprinter" among minimization algorithms. Where table-filling runs in O(n³), Hopcroft finishes in **O(n log n)** - the difference is like bubble sort versus quicksort.

**Idea behind Hopcroft's algorithm:** Instead of finding distinguishable PAIRS, we partition the set of states into EQUIVALENCE CLASSES. We start with the coarse partition {F, Q\F} and progressively refine it by splitting classes based on transitions.

Why do we add the **smaller** of the two to W? This is the key optimization in Hopcroft's algorithm. Each state can be "processed" at most O(log n) times, because each time the class size shrinks by at least half. That is where the O(n log n) comes from.

AlgorithmComplexityApproachWhen to use
Table-filling (Moore)O(n³ · |Σ|)Bottom-up: find distinguishable pairsEducational examples, small automata (< 50 states)
HopcroftO(n · log n · |Σ|)Top-down: split classesProduction use, large automata
BrzozowskiO(2^n) worst caseDouble reversal + determinizationOften fast in practice, but without guarantees

John Hopcroft and the O(n log n) Bound

John Hopcroft published his algorithm in 1971. At that time the best known algorithm ran in O(n²). Hopcroft showed that minimization can be done in O(n log n) - and this result has not been improved in 50+ years. Open question: does an O(n) algorithm exist?

Hopcroft's algorithm is used in every modern compiler to optimize lexical analysis. Every time you compile code - his algorithm is at work.

Hopcroft's algorithm starts with the partition {F, Q\F}. Why this particular initial partition?

Uniqueness of the Minimal DFA

And now - the most beautiful result in all of automata theory. The minimal DFA for a given regular language is **unique** (up to renaming of states). This means: no matter which DFA you start minimizing, the result is always the same.

**Theorem (Myhill-Nerode):** For any regular language L there exists a unique (up to isomorphism) minimal DFA recognizing L. The number of states of this DFA equals the number of equivalence classes of the Myhill-Nerode relation.

What is the Myhill-Nerode relation? Two strings x and y are **Myhill-Nerode equivalent** (x ≡_L y) if for EVERY suffix z: xz ∈ L ⟺ yz ∈ L. In simpler terms - if x and y cannot be distinguished by any "tail".

The uniqueness of the minimal DFA gives us a powerful tool: **checking equivalence of regular languages**. Want to know whether two regular expressions define the same language? Build a DFA for each, minimize both, and compare!

TaskMethodComplexity
Do two regexes define the same language?Minimize DFA + compareO(n log n)
Is DFA already minimal?Minimize + compare sizesO(n log n)
Find canonical formMinimization + BFS numberingO(n log n)
How many states in the minimal DFA?Count Myhill-Nerode classesDepends on the language

**Practical application:** compilers minimize the DFA of the lexical analyzer to reduce the size of the transition table. Fewer states → less memory → faster lexical analysis. Tools like `flex` do this automatically.

DFA minimization can yield different results depending on the algorithm (Moore vs Hopcroft)

The result of minimization is ALWAYS the same - the minimal DFA is unique up to isomorphism. The algorithms differ only in their running speed.

The Myhill-Nerode theorem proves that the equivalence classes are determined by the language, not by the algorithm. Moore and Hopcroft simply find these classes in different ways, but the result is identical.

Two different regular expressions yield minimal DFAs with 4 and 5 states respectively. Can they define the same language?

Key Ideas

  • **Equivalent states** - those that no string can distinguish. They can be merged without loss of information
  • **Table-filling (Moore)** - O(n³): mark distinguishable pairs starting from accept/reject, propagating backward along transitions
  • **Hopcroft** - O(n log n): partition the set of states into classes, refining the partition by transitions. Add the smaller class to the work list
  • **The minimal DFA is unique** - it is the canonical form of the regular language. Two regexes define the same language ⟺ their minimal DFAs are isomorphic

Related Topics

DFA minimization connects to several key topics in formal language theory and compilers:

  • NFA → DFA Conversion — Subset construction can produce an exponentially larger DFA - minimization reduces it back
  • Regular Expressions → Automata — Full pipeline: regex → NFA → DFA → min DFA
  • Pumping Lemma — Pumping length is tied to the number of states - in the minimal DFA this is the exact count

Вопросы для размышления

  • Why can't an NFA be minimized in the same way as a DFA? What breaks?
  • If the minimal DFA is unique, does it follow that the minimal NFA is also unique?
  • The `flex` compiler first builds an NFA, then a DFA, then minimizes. Why not minimize the NFA directly?

Связанные уроки

  • fl-08-nfa-to-dfa — Minimization is applied to DFAs, often after NFA conversion
  • fl-06-dfa — Must understand DFA semantics to reason about equivalence
  • fl-11-pumping-lemma — Minimal DFA characterises the language uniquely - useful for proving non-regularity
  • comp-09-lexer-generators — Tools minimise DFAs to speed up generated lexers
DFA Minimization

0

1

Sign In