Formal Languages
What Is a Formal Language
The Pumping Lemma proves that no regex can reliably parse HTML. Not because regex is weak. It is a mathematical theorem about classes of formal languages. That is exactly why browsers have a dedicated HTML parser - and that design decision is correct, not a workaround. The LLM tokenizer (BPE, tiktoken) is a deterministic finite automaton (DFA). The Python compiler is a CFG with an LR(1) parser. POSIX grep and Perl regex are different automaton classes with different performance guarantees. All of this is the Chomsky hierarchy. Today: the foundation.
- **LLM tokenization (tiktoken, BPE):** GPT-4 splits text into tokens via a DFA - a finite automaton over the Unicode alphabet. A vocabulary of 100,277 tokens is a 100,277-symbol output alphabet
- **Python compiler:** lexer (regular languages, Type 3) → parser (context-free, Type 2) → semantic analysis. Three levels of the Chomsky hierarchy inside one `python script.py`
- **POSIX grep vs Perl regex:** POSIX guarantees O(n) matching - a regular language backed by a DFA. Perl backreferences step outside Type 3 and can be exponential - the root cause of ReDoS attacks on web servers
- **Bioinformatics:** BLAST searches genomes via regular patterns over Σ = {A, T, G, C}. Alignment with gap penalties requires a context-free grammar
Alphabet: The Building Blocks of Words
Before writing a regex, before running a compiler, before tokenizing text for GPT - there is one thing to agree on: the symbols. Python: letters, digits, operators, whitespace. DNA: {A, T, G, C}. Machine code: {0, 1}. This set is the **alphabet** - the single parameter from which all of formal language theory flows. Change the alphabet, change everything.
**Alphabet** - a finite, non-empty set of symbols. Denoted **Σ** (sigma). Elements are called **symbols** or **letters**. Examples: Σ = {0, 1} - binary alphabet, Σ = {a, b, c, …, z} - Latin alphabet, Σ = {A, T, G, C} - genetic alphabet.
One constraint: the alphabet must be **finite**. {1, 2, 3, ...} - not an alphabet. {0, 1, 2, ..., 9} - an alphabet. From a two-symbol alphabet the infinite Σ* grows - all possible strings. That is the power of the formalism: a tiny alphabet, an infinitely rich language space.
| Domain | Alphabet Σ | Size |Σ| |
|---|---|---|
| Binary data | {0, 1} | 2 |
| Genetics | {A, T, G, C} | 4 |
| Morse code | {dot, dash, pause} | 3 |
| ASCII | {chars 0–127} | 128 |
| Unicode | {chars 0–1,114,111} | 1,114,112 |
| Parenthetical expressions | {(, )} | 2 |
Noam Chomsky and Formal Languages
In 1956, linguist Noam Chomsky formalized the notion of language as a mathematical object in his paper "Three Models for the Description of Language." He aimed to describe natural languages (English, Russian), but his formalism proved ideal for describing programming languages. The Chomsky hierarchy became the foundation of compiler theory.
Which of the following sets is a valid alphabet?
Strings: Words Over an Alphabet
Alphabet symbols concatenate into **strings** - finite sequences. "011" over {0, 1}: three symbols. "GATTACA" over {A, T, G, C}: seven symbols, and also a film about genetics. The special case is the **empty string** ε: zero symbols. It shows up in every Kleene-star language, which surprises people the first time.
**String (word)** w over alphabet Σ - a finite sequence of symbols from Σ. **Length** |w| - the number of symbols. **Empty string** ε (epsilon) - a string of length 0: |ε| = 0. **Concatenation** - joining: if w = "ab" and v = "cd", then wv = "abcd". Property: wε = εw = w.
**Σ*** is the universe of all strings over Σ, including ε. Countably infinite even for a one-symbol alphabet. For Σ = {0, 1}: Σ* = {ε, 0, 1, 00, 01, 10, 11, 000, ...} - starting from short strings and going forever. This is the space in which all formal languages live, as subsets of Σ*.
| Notation | Meaning | Example (Σ = {a, b}) |
|---|---|---|
| Σ* | All strings (including ε) | {ε, a, b, aa, ab, ba, bb, aaa, ...} |
| Σ⁺ | All non-empty strings | {a, b, aa, ab, ba, bb, aaa, ...} |
| |w| | Length of string w | |aba| = 3 |
| wv | Concatenation of w and v | ab · ba = abba |
| wⁿ | w repeated n times | (ab)³ = ababab |
| wᴿ | Reversal of string w | (abc)ᴿ = cba |
How many strings of length 3 does Σ* contain for alphabet Σ = {a, b, c}?
Language: A Set of Strings
Σ* is the entire universe of strings. A **formal language** is a selection from that universe: which strings pass, which do not. Python source code is a language: some strings are syntactically valid, others are not. An email address is a language. Legal chess moves in PGN notation are a language. The bytes of a valid JPEG file are a language. All of these are subsets of Σ*.
**Formal language** L over alphabet Σ - any subset of Σ*: **L ⊆ Σ***. A language is a set of strings satisfying some property. Examples: L₁ = {0, 1, 00, 11} (finite), L₂ = {0ⁿ1ⁿ : n ≥ 1} = {01, 0011, 000111, …} (infinite), L₃ = {} = ∅ (empty language), L₄ = {ε} (language containing only the empty string).
**∅ ≠ {ε}** - not pedantry, a critical distinction. ∅ is the empty set: zero strings. {ε} is a set with one element: the empty string. Under concatenation: ∅ · L = ∅ (zero strings), {ε} · L = L (unchanged). Exactly like arithmetic: 0 × x = 0, but 1 × x = x. ε is the identity element for string concatenation.
| Language | Description | Finite? | Example strings |
|---|---|---|---|
| {w ∈ {0,1}* : |w| = 3} | All binary strings of length 3 | Yes (8 strings) | 000, 001, 010, ... |
| {aⁿbⁿ : n ≥ 0} | Equal number of a's then b's | No | ε, ab, aabb, aaabbb |
| {w ∈ {a,b}* : w = wᴿ} | Palindromes | No | ε, a, b, aa, aba, ... |
| ∅ | Empty language | Yes (0 strings) | (no strings) |
| {ε} | Only the empty string | Yes (1 string) | ε |
| Σ* | All strings | No | Anything |
**A formal language is a set, not a rule.** The language {01, 0011, 000111, …} can be described by the rule "n zeros followed by n ones," but the language itself is the set of strings. Different rules may describe the same language.
Which statement is true?
Operations on Languages
Languages are sets, so union, intersection, and complement come for free. But two operations are specific to strings: **concatenation** of languages (pair every string of L1 with every string of L2) and the **Kleene star** (zero or more repetitions). These three operations - union, concatenation, star - generate every regular expression that has ever been written. Literally all of them: `[a-z]+`, `\d{3}-\d{4}`, `(foo|bar)*` - just algebra over these three primitives.
**Operations on Languages L₁, L₂ ⊆ Σ*:** 1. **Union** L₁ ∪ L₂ = {w : w ∈ L₁ or w ∈ L₂} 2. **Intersection** L₁ ∩ L₂ = {w : w ∈ L₁ and w ∈ L₂} 3. **Complement** L̄ = Σ* \ L 4. **Concatenation** L₁L₂ = {uv : u ∈ L₁, v ∈ L₂} 5. **Kleene star** L* = {ε} ∪ L ∪ LL ∪ LLL ∪ … = ∪_{n≥0} Lⁿ.
**Kleene star L*** - zero or more repetitions of strings from L in any order. For L = {a, b}: L* = all strings over {a, b}. That is why `.*` in regex means "anything": it is ({all symbols})*. Kleene invented the star in 1956 to describe regular languages mathematically. Today his notation is in every `re.compile()`, every `grep -E`, every `sed` and `awk` on the planet.
| Operation | Notation | Example (L = {0, 11}) | Result |
|---|---|---|---|
| Concatenation | L·L = L² | {0,11}·{0,11} | {00, 011, 110, 1111} |
| Power 0 | L⁰ | For any L | {ε} |
| Power 1 | L¹ | {0, 11} | {0, 11} |
| Kleene star | L* | ∪ Lⁿ for n≥0 | {ε, 0, 11, 00, 011, 110, 1111, ...} |
| Kleene plus | L⁺ | ∪ Lⁿ for n≥1 | {0, 11, 00, 011, 110, 1111, ...} |
**L* always contains ε** (the empty string), even when L does not. By definition: L⁰ = {ε} ⊆ L*. But **∅* = {ε}** - the Kleene star of the empty language is not the empty language, it is {ε}! This is a common exam trap.
A formal language is a programming language. Formal language theory is only useful for building compilers
A formal language is a mathematical abstraction: any set of strings L ⊆ Σ*. A programming language is merely a special case. Formal languages describe DNA sequences, communication protocols, musical patterns, and legal chess moves
A formal language is L ⊆ Σ*. Full stop. No word of 'computer', 'programming', or 'syntax'. Type 3 languages (regular) describe genome patterns in BLAST, IDS/IPS traffic filtering rules, and morphological patterns in NLP tokenizers. Type 2 languages (CFG) model the grammar of natural languages - imperfectly, but better than Type 3. Compilers are simply the most famous application of this mathematical tool, not its definition.
L = {ab, c}. Which strings belong to L²?
Key Takeaways
- **Alphabet Σ** - a finite, non-empty set of symbols. Σ* - all strings over it, an infinite set. Two symbols {0,1} generate an infinite Σ*
- **Formal language L ⊆ Σ*** - any subset of strings. A language is a set, not a rule. Different rules may describe the same language
- **Kleene star L*** = zero or more repetitions from L. Always contains ε. Foundation of regex: `(a|b)*` = ({a}∪{b})*
- **Chomsky hierarchy:** regex = Type 3 (DFA). HTML/balanced parens = Type 2 (CFG, stack). Turing-complete programs = Type 0. Why browsers don't parse HTML with regex - this hierarchy explains it
Related Topics
Formal languages are the foundation for the Chomsky hierarchy, automata theory, and computability theory:
- The Chomsky Hierarchy — Classification of languages by complexity: regular → CF → CS → recursively enumerable
- Grammars — Formal rules for generating the strings of a language
- Regular Expressions — Practical application: union, concatenation, and Kleene star in grep/sed/awk
Вопросы для размышления
- Is the set of all grammatically correct Russian sentences a formal language? If so, what is the alphabet?
- ∅* = {ε}. Explain intuitively why the Kleene star of the empty language is non-empty.
- Return to the opening grep example: the regex ^[0-9]{3}-[0-9]{4}$ describes a formal language. Describe that language using union, concatenation, and Kleene star.
Связанные уроки
- aut-01-fsm — Finite automata are the machines that recognize regular languages (Kleene's theorem)
- comp-01-intro — Compilers use different levels of the Chomsky hierarchy at each phase
- alg-01-big-o — Language recognition complexity: O(n) for DFA, O(n³) for CYK context-free parsing
- dm-01 — Set theory and set operations are the mathematical language of formal languages
- ml-01
- ml-04