Formal Languages

What Is a Formal Language

The Pumping Lemma proves that no regex can reliably parse HTML. Not because regex is weak. It is a mathematical theorem about classes of formal languages. That is exactly why browsers have a dedicated HTML parser - and that design decision is correct, not a workaround. The LLM tokenizer (BPE, tiktoken) is a deterministic finite automaton (DFA). The Python compiler is a CFG with an LR(1) parser. POSIX grep and Perl regex are different automaton classes with different performance guarantees. All of this is the Chomsky hierarchy. Today: the foundation.

**LLM tokenization (tiktoken, BPE):** GPT-4 splits text into tokens via a DFA - a finite automaton over the Unicode alphabet. A vocabulary of 100,277 tokens is a 100,277-symbol output alphabet
**Python compiler:** lexer (regular languages, Type 3) → parser (context-free, Type 2) → semantic analysis. Three levels of the Chomsky hierarchy inside one `python script.py`
**POSIX grep vs Perl regex:** POSIX guarantees O(n) matching - a regular language backed by a DFA. Perl backreferences step outside Type 3 and can be exponential - the root cause of ReDoS attacks on web servers
**Bioinformatics:** BLAST searches genomes via regular patterns over Σ = {A, T, G, C}. Alignment with gap penalties requires a context-free grammar

Alphabet: The Building Blocks of Words

Before writing a regex, before running a compiler, before tokenizing text for GPT - there is one thing to agree on: the symbols. Python: letters, digits, operators, whitespace. DNA: {A, T, G, C}. Machine code: {0, 1}. This set is the **alphabet** - the single parameter from which all of formal language theory flows. Change the alphabet, change everything.

**Alphabet** - a finite, non-empty set of symbols. Denoted **Σ** (sigma). Elements are called **symbols** or **letters**. Examples: Σ = {0, 1} - binary alphabet, Σ = {a, b, c, …, z} - Latin alphabet, Σ = {A, T, G, C} - genetic alphabet.

One constraint: the alphabet must be **finite**. {1, 2, 3, ...} - not an alphabet. {0, 1, 2, ..., 9} - an alphabet. From a two-symbol alphabet the infinite Σ* grows - all possible strings. That is the power of the formalism: a tiny alphabet, an infinitely rich language space.

Domain	Alphabet Σ	Size \|Σ\|
Binary data	{0, 1}	2
Genetics	{A, T, G, C}	4
Morse code	{dot, dash, pause}	3
ASCII	{chars 0–127}	128
Unicode	{chars 0–1,114,111}	1,114,112
Parenthetical expressions	{(, )}	2

Noam Chomsky and Formal Languages

In 1956, linguist Noam Chomsky formalized the notion of language as a mathematical object in his paper "Three Models for the Description of Language." He aimed to describe natural languages (English, Russian), but his formalism proved ideal for describing programming languages. The Chomsky hierarchy became the foundation of compiler theory.

Which of the following sets is a valid alphabet?

Strings: Words Over an Alphabet

Alphabet symbols concatenate into **strings** - finite sequences. "011" over {0, 1}: three symbols. "GATTACA" over {A, T, G, C}: seven symbols, and also a film about genetics. The special case is the **empty string** ε: zero symbols. It shows up in every Kleene-star language, which surprises people the first time.

**String (word)** w over alphabet Σ - a finite sequence of symbols from Σ. **Length** |w| - the number of symbols. **Empty string** ε (epsilon) - a string of length 0: |ε| = 0. **Concatenation** - joining: if w = "ab" and v = "cd", then wv = "abcd". Property: wε = εw = w.

**Σ*** is the universe of all strings over Σ, including ε. Countably infinite even for a one-symbol alphabet. For Σ = {0, 1}: Σ* = {ε, 0, 1, 00, 01, 10, 11, 000, ...} - starting from short strings and going forever. This is the space in which all formal languages live, as subsets of Σ*.

Notation	Meaning	Example (Σ = {a, b})
Σ*	All strings (including ε)	{ε, a, b, aa, ab, ba, bb, aaa, ...}
Σ⁺	All non-empty strings	{a, b, aa, ab, ba, bb, aaa, ...}
\|w\|	Length of string w	\|aba\| = 3
wv	Concatenation of w and v	ab · ba = abba
wⁿ	w repeated n times	(ab)³ = ababab
wᴿ	Reversal of string w	(abc)ᴿ = cba

How many strings of length 3 does Σ* contain for alphabet Σ = {a, b, c}?

Language: A Set of Strings

Σ* is the entire universe of strings. A **formal language** is a selection from that universe: which strings pass, which do not. Python source code is a language: some strings are syntactically valid, others are not. An email address is a language. Legal chess moves in PGN notation are a language. The bytes of a valid JPEG file are a language. All of these are subsets of Σ*.

**Formal language** L over alphabet Σ - any subset of Σ*: **L ⊆ Σ***. A language is a set of strings satisfying some property. Examples: L₁ = {0, 1, 00, 11} (finite), L₂ = {0ⁿ1ⁿ : n ≥ 1} = {01, 0011, 000111, …} (infinite), L₃ = {} = ∅ (empty language), L₄ = {ε} (language containing only the empty string).

**∅ ≠ {ε}** - not pedantry, a critical distinction. ∅ is the empty set: zero strings. {ε} is a set with one element: the empty string. Under concatenation: ∅ · L = ∅ (zero strings), {ε} · L = L (unchanged). Exactly like arithmetic: 0 × x = 0, but 1 × x = x. ε is the identity element for string concatenation.

Language	Description	Finite?	Example strings
{w ∈ {0,1}* : \|w\| = 3}	All binary strings of length 3	Yes (8 strings)	000, 001, 010, ...
{aⁿbⁿ : n ≥ 0}	Equal number of a's then b's	No	ε, ab, aabb, aaabbb
{w ∈ {a,b}* : w = wᴿ}	Palindromes	No	ε, a, b, aa, aba, ...
∅	Empty language	Yes (0 strings)	(no strings)
{ε}	Only the empty string	Yes (1 string)	ε
Σ*	All strings	No	Anything

**A formal language is a set, not a rule.** The language {01, 0011, 000111, …} can be described by the rule "n zeros followed by n ones," but the language itself is the set of strings. Different rules may describe the same language.

Which statement is true?

Operations on Languages

Languages are sets, so union, intersection, and complement come for free. But two operations are specific to strings: **concatenation** of languages (pair every string of L1 with every string of L2) and the **Kleene star** (zero or more repetitions). These three operations - union, concatenation, star - generate every regular expression that has ever been written. Literally all of them: `[a-z]+`, `\d{3}-\d{4}`, `(foo|bar)*` - just algebra over these three primitives.

**Operations on Languages L₁, L₂ ⊆ Σ*:** 1. **Union** L₁ ∪ L₂ = {w : w ∈ L₁ or w ∈ L₂} 2. **Intersection** L₁ ∩ L₂ = {w : w ∈ L₁ and w ∈ L₂} 3. **Complement** L̄ = Σ* \ L 4. **Concatenation** L₁L₂ = {uv : u ∈ L₁, v ∈ L₂} 5. **Kleene star** L* = {ε} ∪ L ∪ LL ∪ LLL ∪ … = ∪_{n≥0} Lⁿ.

**Kleene star L*** - zero or more repetitions of strings from L in any order. For L = {a, b}: L* = all strings over {a, b}. That is why `.*` in regex means "anything": it is ({all symbols})*. Kleene invented the star in 1956 to describe regular languages mathematically. Today his notation is in every `re.compile()`, every `grep -E`, every `sed` and `awk` on the planet.

Operation	Notation	Example (L = {0, 11})	Result
Concatenation	L·L = L²	{0,11}·{0,11}	{00, 011, 110, 1111}
Power 0	L⁰	For any L	{ε}
Power 1	L¹	{0, 11}	{0, 11}
Kleene star	L*	∪ Lⁿ for n≥0	{ε, 0, 11, 00, 011, 110, 1111, ...}
Kleene plus	L⁺	∪ Lⁿ for n≥1	{0, 11, 00, 011, 110, 1111, ...}

**L* always contains ε** (the empty string), even when L does not. By definition: L⁰ = {ε} ⊆ L*. But **∅* = {ε}** - the Kleene star of the empty language is not the empty language, it is {ε}! This is a common exam trap.

A formal language is a programming language. Formal language theory is only useful for building compilers

A formal language is a mathematical abstraction: any set of strings L ⊆ Σ*. A programming language is merely a special case. Formal languages describe DNA sequences, communication protocols, musical patterns, and legal chess moves

A formal language is L ⊆ Σ*. Full stop. No word of 'computer', 'programming', or 'syntax'. Type 3 languages (regular) describe genome patterns in BLAST, IDS/IPS traffic filtering rules, and morphological patterns in NLP tokenizers. Type 2 languages (CFG) model the grammar of natural languages - imperfectly, but better than Type 3. Compilers are simply the most famous application of this mathematical tool, not its definition.

L = {ab, c}. Which strings belong to L²?

Key Takeaways

**Alphabet Σ** - a finite, non-empty set of symbols. Σ* - all strings over it, an infinite set. Two symbols {0,1} generate an infinite Σ*
**Formal language L ⊆ Σ*** - any subset of strings. A language is a set, not a rule. Different rules may describe the same language
**Kleene star L*** = zero or more repetitions from L. Always contains ε. Foundation of regex: `(a|b)*` = ({a}∪{b})*
**Chomsky hierarchy:** regex = Type 3 (DFA). HTML/balanced parens = Type 2 (CFG, stack). Turing-complete programs = Type 0. Why browsers don't parse HTML with regex - this hierarchy explains it

Вопросы для размышления

Is the set of all grammatically correct Russian sentences a formal language? If so, what is the alphabet?
∅* = {ε}. Explain intuitively why the Kleene star of the empty language is non-empty.
Return to the opening grep example: the regex ^[0-9]{3}-[0-9]{4}$ describes a formal language. Describe that language using union, concatenation, and Kleene star.

Связанные уроки

aut-01-fsm — Finite automata are the machines that recognize regular languages (Kleene's theorem)
comp-01-intro — Compilers use different levels of the Chomsky hierarchy at each phase
alg-01-big-o — Language recognition complexity: O(n) for DFA, O(n³) for CYK context-free parsing
dm-01 — Set theory and set operations are the mathematical language of formal languages
ml-01
ml-04

Formal Languages

What Is a Formal Language

**LLM tokenization (tiktoken, BPE):** GPT-4 splits text into tokens via a DFA - a finite automaton over the Unicode alphabet. A vocabulary of 100,277 tokens is a 100,277-symbol output alphabet
**Python compiler:** lexer (regular languages, Type 3) → parser (context-free, Type 2) → semantic analysis. Three levels of the Chomsky hierarchy inside one `python script.py`
**POSIX grep vs Perl regex:** POSIX guarantees O(n) matching - a regular language backed by a DFA. Perl backreferences step outside Type 3 and can be exponential - the root cause of ReDoS attacks on web servers
**Bioinformatics:** BLAST searches genomes via regular patterns over Σ = {A, T, G, C}. Alignment with gap penalties requires a context-free grammar

Alphabet: The Building Blocks of Words

Domain	Alphabet Σ	Size \|Σ\|
Binary data	{0, 1}	2
Genetics	{A, T, G, C}	4
Morse code	{dot, dash, pause}	3
ASCII	{chars 0–127}	128
Unicode	{chars 0–1,114,111}	1,114,112
Parenthetical expressions	{(, )}	2

Noam Chomsky and Formal Languages

Which of the following sets is a valid alphabet?

Strings: Words Over an Alphabet

Notation	Meaning	Example (Σ = {a, b})
Σ*	All strings (including ε)	{ε, a, b, aa, ab, ba, bb, aaa, ...}
Σ⁺	All non-empty strings	{a, b, aa, ab, ba, bb, aaa, ...}
\|w\|	Length of string w	\|aba\| = 3
wv	Concatenation of w and v	ab · ba = abba
wⁿ	w repeated n times	(ab)³ = ababab
wᴿ	Reversal of string w	(abc)ᴿ = cba

How many strings of length 3 does Σ* contain for alphabet Σ = {a, b, c}?

Language: A Set of Strings

Language	Description	Finite?	Example strings
{w ∈ {0,1}* : \|w\| = 3}	All binary strings of length 3	Yes (8 strings)	000, 001, 010, ...
{aⁿbⁿ : n ≥ 0}	Equal number of a's then b's	No	ε, ab, aabb, aaabbb
{w ∈ {a,b}* : w = wᴿ}	Palindromes	No	ε, a, b, aa, aba, ...
∅	Empty language	Yes (0 strings)	(no strings)
{ε}	Only the empty string	Yes (1 string)	ε
Σ*	All strings	No	Anything

Which statement is true?

Operations on Languages

Operation	Notation	Example (L = {0, 11})	Result
Concatenation	L·L = L²	{0,11}·{0,11}	{00, 011, 110, 1111}
Power 0	L⁰	For any L	{ε}
Power 1	L¹	{0, 11}	{0, 11}
Kleene star	L*	∪ Lⁿ for n≥0	{ε, 0, 11, 00, 011, 110, 1111, ...}
Kleene plus	L⁺	∪ Lⁿ for n≥1	{0, 11, 00, 011, 110, 1111, ...}

A formal language is a programming language. Formal language theory is only useful for building compilers

L = {ab, c}. Which strings belong to L²?

Key Takeaways

**Alphabet Σ** - a finite, non-empty set of symbols. Σ* - all strings over it, an infinite set. Two symbols {0,1} generate an infinite Σ*
**Formal language L ⊆ Σ*** - any subset of strings. A language is a set, not a rule. Different rules may describe the same language
**Kleene star L*** = zero or more repetitions from L. Always contains ε. Foundation of regex: `(a|b)*` = ({a}∪{b})*
**Chomsky hierarchy:** regex = Type 3 (DFA). HTML/balanced parens = Type 2 (CFG, stack). Turing-complete programs = Type 0. Why browsers don't parse HTML with regex - this hierarchy explains it

Вопросы для размышления

Is the set of all grammatically correct Russian sentences a formal language? If so, what is the alphabet?
∅* = {ε}. Explain intuitively why the Kleene star of the empty language is non-empty.
Return to the opening grep example: the regex ^[0-9]{3}-[0-9]{4}$ describes a formal language. Describe that language using union, concatenation, and Kleene star.

Связанные уроки

aut-01-fsm — Finite automata are the machines that recognize regular languages (Kleene's theorem)
comp-01-intro — Compilers use different levels of the Chomsky hierarchy at each phase
alg-01-big-o — Language recognition complexity: O(n) for DFA, O(n³) for CYK context-free parsing
dm-01 — Set theory and set operations are the mathematical language of formal languages
ml-01
ml-04

What Is a Formal Language

Alphabet: The Building Blocks of Words

Noam Chomsky and Formal Languages

Strings: Words Over an Alphabet

Language: A Set of Strings

Operations on Languages

Key Takeaways

Related Topics

Вопросы для размышления

Связанные уроки

What Is a Formal Language

Alphabet: The Building Blocks of Words

Noam Chomsky and Formal Languages

Strings: Words Over an Alphabet

Language: A Set of Strings

Operations on Languages

Key Takeaways

Related Topics

Вопросы для размышления

Связанные уроки