Formal Languages
Operations on Languages
Every regex is a combination of exactly three operations: `|` (union), symbol joining (concatenation), and `*` (Kleene star). Stephen Kleene proved in 1956 that these three operations are SUFFICIENT to describe any regular language. In 1968, Ken Thompson used this exact theorem to implement the regex engine inside the Unix editor `ed` - which still powers `grep` today. Three operations. One theorem. Over fifty years of tooling built on top.
- **Regex in code:** `(GET|POST|PUT)\s+/api/.*` - union of methods, concatenation with path, Kleene star for an arbitrary suffix
- **Compilers:** the lexer builds tokens via union (keyword | identifier | number) and concatenation (sign + digits = signed number)
- **Network filters:** firewall rules combine IP patterns via union and concatenation - these are operations on formal languages of packets
Предварительные знания
Kleene's star and the birth of formal language theory
In 1956 Stephen Kleene published 'Representation of events in nerve nets and finite automata', introducing the three operations - union, concatenation, and iteration (the Kleene star). He proved these three operations are sufficient to characterize every language recognized by a finite automaton. The paper was part of a broader effort to model McCulloch-Pitts neural nets mathematically. In 1968 Ken Thompson, co-creator of Unix, translated Kleene's theorem directly into code: his NFA construction algorithm for the editor `ed` turns a regex into a nondeterministic automaton by recursively applying the three operations as state-machine fragments. Thompson's algorithm remains the basis for RE2, Go's `regexp` package, and every backtrack-free regex engine today. Each time `grep` runs, it executes a 1956 theorem.
Language Union
Consider two filters in an inbox: one passes messages from colleagues, another passes messages marked "Urgent". To see messages matching **at least one** filter, both are **combined**. Formal language union works exactly the same way.
**Union** of languages L₁ and L₂ is the language L₁ ∪ L₂ = {w : w ∈ L₁ **or** w ∈ L₂}. A string belongs to the union if it belongs to at least one of the languages. This is the standard set-theoretic union operation.
**Union always "expands" a language:** |L₁ ∪ L₂| ≥ max(|L₁|, |L₂|). If the languages are disjoint (L₁ ∩ L₂ = ∅), then |L₁ ∪ L₂| = |L₁| + |L₂|. If one language is a subset of the other (L₁ ⊆ L₂), then L₁ ∪ L₂ = L₂.
| L₁ | L₂ | L₁ ∪ L₂ | Comment |
|---|---|---|---|
| {a, ab} | {b, ba} | {a, ab, b, ba} | Disjoint languages |
| {0, 1} | {1, 00} | {0, 1, 00} | Shared element: 1 |
| ∅ | L | L | ∅ - identity element |
| Σ* | L | Σ* | Σ* absorbs everything |
| {ε} | {a} | {ε, a} | Empty string + symbol |
In regular expressions, union is written as **|** (the pipe character). The regex `cat|dog` describes the language {cat, dog} = {cat} ∪ {dog}. The `|` symbol in regex is precisely formal language union.
L₁ = {a, bc}, L₂ = {bc, d}. What is L₁ ∪ L₂?
Language Concatenation
Union takes strings as-is. Concatenation is a fundamentally different operation: it **joins** strings pairwise. Take one string from the first language, one from the second, join them - the result is a string of the new language. Repeat for **all** combinations.
**Concatenation** of languages L₁ and L₂: **L₁·L₂ = {xy : x ∈ L₁, y ∈ L₂}**. For each pair of strings (one from L₁, one from L₂), a new string is created by joining them. Order matters: L₁·L₂ ≠ L₂·L₁ in general.
Note: |L₁·L₂| ≤ |L₁| × |L₂|. Equality is not always achieved - different pairs can produce the same result. For example, if L = {a, aa}, then L² = {aa, aaa, aaaa} - three strings, not four, because a·aa = aa·a = aaa.
| L₁ | L₂ | L₁·L₂ | |L₁·L₂| |
|---|---|---|---|
| {a, b} | {0, 1} | {a0, a1, b0, b1} | 4 |
| {ε} | L | L | |L| |
| ∅ | L | ∅ | 0 |
| {a, aa} | {a, aa} | {aa, aaa, aaaa} | 3 (not 4!) |
| {0, 1} | {0, 1} | {00, 01, 10, 11} | 4 |
**Two special cases of concatenation:** 1. L·{ε} = {ε}·L = L - the empty string acts as the identity element. 2. L·∅ = ∅·L = ∅ - the empty language "annihilates" concatenation. Don't confuse them: {ε} preserves the language, ∅ annihilates it.
L₁ = {0, 11}, L₂ = {ε, 0}. How many strings does L₁·L₂ contain?
Kleene Star
Think of a set of LEGO bricks (a finite language). Any number of bricks can be assembled in **any order**, **with repetition**. Or none at all - leaving an empty shelf. This is the **Kleene star** - the most powerful of the three basic operations.
**Kleene star:** L* = L⁰ ∪ L¹ ∪ L² ∪ L³ ∪ ... = ∪_{n≥0} Lⁿ. This is the infinite union of all powers of the language. L⁰ = {ε} (always!), so **ε ∈ L* for any L**, including L = ∅. **Kleene plus:** L⁺ = L¹ ∪ L² ∪ L³ ∪ ... = L* \ {ε} (when ε ∉ L).
**Connection to Σ*.** If Σ = {a, b}, then Σ* is the set of ALL strings over this alphabet. But {a, b}* is also the set of all strings! So Σ* in the definition of a formal language is literally the Kleene star of the alphabet.
| Language L | L* | Description |
|---|---|---|
| {0, 1} | {ε, 0, 1, 00, 01, 10, 11, ...} | All binary strings = Σ* |
| {a} | {ε, a, aa, aaa, ...} | Strings of a's only: a* |
| {ab} | {ε, ab, abab, ababab, ...} | Repetitions of 'ab': (ab)* |
| ∅ | {ε} | Only element is ε |
| {ε} | {ε} | Repeating ε yields ε |
| {0, 11} | {ε, 0, 11, 00, 011, 110, 1111, ...} | Arbitrary combinations of 0 and 11 |
**Mnemonic:** L***** = "**zero** or more"; L**⁺** = "**one** or more". In regex: `a*` matches the empty string, `a+` does not. This follows directly from the formal definition.
L = {01}. Which strings belong to L*?
Closure Properties of Language Classes
We have studied three operations: union, concatenation, and Kleene star. A natural question arises: if two **regular** languages are unioned, does the result remain regular? What if the Kleene star is applied to a **context-free** language? The answer to these questions is called **closure**.
**Closure** of a language class under an operation means: applying the operation to languages from that class always yields a result in the same class. A class C is **closed** under operation ⊕ if for any L₁, L₂ ∈ C we have L₁ ⊕ L₂ ∈ C.
| Operation | Regular | Context-free | Context-sensitive |
|---|---|---|---|
| Union ∪ | ✓ Closed | ✓ Closed | ✓ Closed |
| Concatenation · | ✓ Closed | ✓ Closed | ✓ Closed |
| Kleene Star * | ✓ Closed | ✓ Closed | ✓ Closed |
| Intersection ∩ | ✓ Closed | ✗ NOT closed | ✓ Closed |
| Complement ¬ | ✓ Closed | ✗ NOT closed | ✓ Closed |
**Regular languages are the most "robust":** closed under all five operations. This makes them practical - regular expressions can be combined arbitrarily and the result is always regular. **Context-free languages** are weaker: they break down under intersection and complement.
Stephen Kleene and His Star
Stephen Kleene (1909-1994) introduced the Kleene star in his 1956 paper "Representation of events in nerve nets and finite automata". He was studying McCulloch-Pitts neural network models and formalized the notion of a "regular event" - a set describable by combinations of union, concatenation, and iteration. His notation has survived decades and is used by millions of programmers today - every time `.*` appears in a regex.
**Closure is a powerful proof tool.** Showing that language L is the intersection of two CFLs does NOT prove that L is context-free (CFLs are not closed under ∩). But if L is the union of two regular languages, it is definitely regular (regular languages are closed under ∪).
If two languages belong to the same class, the result of any operation on them will also be in that class
Closure depends on the SPECIFIC operation and the SPECIFIC class. Context-free languages are closed under union and concatenation, but NOT under intersection and complement
Each (class, operation) pair must be verified independently. The intersection of two CFLs can yield the language {aⁿbⁿcⁿ}, which no CFG can describe. Closure is a property that must be proved, not assumed.
Language L₁ is regular, L₂ is regular. What can be said about L₁ ∩ L₂?
Key Takeaways
- **Union L₁ ∪ L₂** - strings from the first OR second language; in regex this is `|`
- **Concatenation L₁·L₂** - joining all pairs of strings; {ε} is the identity element, ∅ is the annihilator
- **Kleene star L*** = ∪ Lⁿ - zero or more repetitions; **always contains ε**; L⁺ = L* without ε
- **Closure** - regular languages are closed under all operations; CFLs break down under intersection and complement
Related Topics
Language operations bridge abstract sets and practical tools:
- Regular Expressions — The three basic operations (∪, ·, *) are exactly the three regex operators. The next lesson introduces the formal syntax
- Finite Automata — NFA/DFA are constructed via union, concatenation, and star - Thompson's construction
- Chomsky Hierarchy — Closure properties define the boundaries of each level in the hierarchy
Вопросы для размышления
- Return to the regex from the introduction: `(GET|POST|PUT)\s+/api/.*`. Identify all three operations in it: union, concatenation, and Kleene star.
- Why does ∅* = {ε}? If the empty language contains no strings, where does ε come from? Hint: L⁰ = {ε} by definition.
- Context-free languages are closed under union, but not under intersection. Using De Morgan's law (A ∩ B = ¬(¬A ∪ ¬B)), explain why they are also not closed under complement.
Связанные уроки
- fl-01-intro — Formal alphabet, Σ*, and the definition of a language are prerequisites
- fl-05-regex — Regex syntax is a direct encoding of union (|), concatenation, and Kleene star (*)
- fl-06-dfa — Thompson's NFA construction applies each operation to build automaton fragments
- fl-02-chomsky — Closure properties under the three operations define the Chomsky hierarchy boundaries
- fl-07-nfa — NFA construction from regex directly implements union, concatenation, and star as state fragments
- ml-01
- alg-35-bit-manipulation