Formal Languages

The Chomsky Hierarchy

Noam Chomsky and Three Models of Language

In 1956, linguist Noam Chomsky published 'Three models for the description of language'. The goal was to formalize the grammar of natural languages. But something larger emerged: Chomsky described the expressiveness of computation. His four grammar types correspond precisely to four classes of automata - DFA, PDA, LBA, TM. The paper became the foundation of formal language theory and computability theory.

Цели урока

  • Distinguish the four language types in the Chomsky hierarchy and their recognizers
  • Identify the type of a given task and choose the right tool (regex/PDA/LBA/TM)
  • Explain why SQL injection cannot be blocked with regex alone
  • Explain why the TypeScript type checker always terminates while Agda's does not

Chomsky's 1956 hierarchy was not about human language. It predicted that TypeScript's type checker cannot verify all correct programs (Rice's theorem - Type 0). That JSON cannot be validated with regex (Type 2 vs Type 3). That SQL injection cannot be blocked with patterns alone. One paper by a linguist - and 70 years of architectural decisions follow from it.

  • **Compilers** - lexer (regex, Type 3) -> parser (CFG, Type 2) -> semantic analysis (Type 1+): three layers of the hierarchy in one tool
  • **SQL injection** - regex-based defenses (Type 3) cannot verify the CF structure of a SQL query. Parameterized queries operate at the right level
  • **TypeScript** - the type system is Turing-complete (Type 0), but Microsoft caps recursion depth so the checker always terminates
  • **Bioinformatics** - regex for motif search in DNA, CFGs for RNA secondary structure (stack = hairpin loops)
  • **RE2/Rust regex** - guaranteed O(n) via DFA without backtracking: a deliberate restriction to Type 3 for predictable performance

Предварительные знания

  • What Is a Formal Language

Type 0: Recursively Enumerable Languages

Starting with the most powerful class - **Type 0**. These are the languages that a **Turing machine** - the theoretical model of any computer - can recognize. A Type 0 grammar places no restrictions on its rules: any substring may be replaced by any other substring. Dependent types in Agda/Coq, full HM-style type inference - these are Type 0 computations.

**Recursively enumerable languages (r.e.)** - languages for which a Turing machine (TM) exists that accepts all strings in L. A Type 0 (unrestricted) grammar has rules of the form α -> β, where α contains at least one non-terminal. **The catch:** TM may loop forever on strings not in L - there is no halt guarantee.

**Recursive languages** (a subclass of r.e.) - those for which the TM always halts: it either accepts or rejects. For recursively enumerable (but non-recursive) languages, the TM may run forever on a string not in L. The language of 'programs that halt' is r.e. but not recursive - the halting problem.

ClassTM halts?DecidabilityExample
RecursiveAlways (for w in L and w not in L)Decidable{a^n b^n c^n}
R.e., not recursiveFor w in L - yes; for w not in L - may loopSemi-decidableThe halting problem
Not r.e.No TM recognizes LUndecidableComplement of the halting problem

A Turing machine for a recursively enumerable language L receives a string w not in L. What happens?

Type 1: Context-Sensitive Languages

**Type 1** adds one restriction: the right-hand side of every rule is no shorter than the left-hand side. Derivations never shorten the string. This models **context-sensitive** substitutions: a non-terminal is replaced differently depending on its neighbors. HTML in its full specification is context-sensitive: the rule 'the `<li>` tag is valid only inside `<ul>` or `<ol>`' depends on context.

**Context-sensitive languages (CSL)** - Type 1 in the Chomsky hierarchy. Grammar: rules of the form αAβ -> αγβ, where A is a non-terminal, α and β are context, and γ is a non-empty replacement. Equivalently: |right-hand side| >= |left-hand side|. Recognizer: **Linear Bounded Automaton (LBA)** - a Turing machine whose tape is bounded by the input length.

**Why 'context-sensitive'?** In the rule αAβ -> αγβ, non-terminal A is replaced by γ only when α is to its left and β is to its right. In SQL: the expression `ORDER BY` has different semantics depending on whether it appears inside `SELECT` or `CREATE VIEW` - that is context-sensitive behavior.

PropertyType 0Type 1 (CSL)
Rulesα -> β (unrestricted)|α| <= |β| (non-contracting)
RecognizerTuring machineLBA (TM with bounded tape)
DecidabilitySemi-decidableAlways decidable
ExampleThe halting problem{a^n b^n c^n}, {ww}

Language L = {ww : w in {a,b}*} (strings made of two identical halves). What type does it belong to?

Type 2: Context-Free Languages

**Type 2** is the workhorse of Computer Science. Rules take the form A -> γ: a single non-terminal on the left, anything on the right. Context is irrelevant - hence 'context-**free**'. CFGs describe the syntax of programming languages: nested brackets, if-then-else, arithmetic expressions. JSON, XML, and most programming languages are Type 2. Parsers (LL, LR, PEG) in compilers are Type 2 recognizers.

**Context-free languages (CFL)** - Type 2. Grammar (CFG): rules A -> γ, where A is a single non-terminal. Recognizer: **pushdown automaton (PDA)** - a finite automaton with a stack. The stack enables tracking of nesting depth.

LanguageCF?Why
{a^n b^n : n >= 0}YesStack counts a's and matches b's
{bracket expressions}YesNesting = stack
Python/Java/JSON syntaxYes (approximately)Recursive descent / LR parser
{a^n b^n c^n}NOStack cannot count three groups simultaneously
{ww}NOStack is LIFO; direct access is required
{ww^R} (palindromes)YesPush w, pop for w^R - the stack is perfect!

**Why does SQL injection bypass regex-based defenses?** Regex (Type 3) cannot verify the context-free structure of a SQL query. Parameterized queries work at the correct level - they operate on the syntax (Type 2), not surface patterns.

The language of valid bracket expressions {(), (()), ()(), (()()), ...} is:

Type 3: Regular Languages

**Type 3** is the simplest yet most practical class. Regular languages are described by **regular expressions** and recognized by **finite automata** (DFA/NFA). No stack, no tape - just a finite set of states. grep, sed, awk, form validation, the lexer in a compiler - all Type 3. RE2 (Google) and Rust's `regex` crate guarantee O(n) time with no exponential backtracking - a deliberate restriction to Type 3 for predictability.

**Regular languages** - Type 3. Grammar: rules A -> aB or A -> a (right-linear). Recognizer: **DFA/NFA**. Equivalent description: **regular expressions** (union, concatenation, Kleene star). All three representations are equivalent - Kleene's theorem.

DescriptionRegexDFA states
Strings starting with 0^0.*2
Even number of symbols(..)*2
Contains substring 'ab'.*ab.*3
Email (simplified)[a-z]+@[a-z]+\.[a-z]+~7
IP address(\d{1,3}\.){3}\d{1,3}~15

**Limitations of regular languages:** a finite automaton has no memory beyond its current state. It cannot count, compare substrings, or verify nesting. {a^n b^n}, {ww}, balanced brackets - none of these are regular. That is precisely why 'verify that HTML tags are correctly nested' cannot be solved with a regex.

Which of these languages is NOT regular?

Hierarchy: Nesting and Boundaries

The four types form a strict **nested hierarchy**: Regular < CF < CS < R.E. At every boundary there exist languages that belong to the broader class but not the narrower one. This hierarchy is a map of computational power. Knowing the level of a task immediately tells what tool is needed.

**The Chomsky Hierarchy:** Type 3 (Regular) is a proper subset of Type 2 (Context-Free) is a proper subset of Type 1 (Context-Sensitive) is a proper subset of Type 0 (Recursively Enumerable). Strictness: {a^n b^n} is CF but not regular. {a^n b^n c^n} is CS but not CF. The halting problem is Type 0 but not Type 1.

TypeGrammarAutomatonPractical use
3 - RegularRight-linearDFA / NFAgrep, awk, compiler lexer
2 - CFCFGPDA (stack)Parser, JSON/XML, most PLs
1 - CSCSGLBAFull HTML spec, some type checks
0 - R.E.UnrestrictedTMDependent types, full type inference

**Practical significance:** email validation - O(n) in a single pass (DFA). JSON parsing - O(n) with a stack (PDA). TypeScript type checking - always terminates (Microsoft caps recursion depth, even though the type system is Turing-complete). Knowing the hierarchy immediately answers 'what tool should be applied here'.

**Beyond the hierarchy** lie languages that are not even recursively enumerable - no Turing machine can recognize them. Example: the set of all TMs that do NOT halt (complement of the halting problem). In a set-theoretic sense, there are 'more' such languages than recognizable ones.

HTML is a context-free language because it has nested tags

HTML in its full specification is context-sensitive. Rules such as 'the li tag is valid only inside ul or ol' depend on context. Browsers use specialized (non-CF) parsers that handle invalid HTML through error recovery.

A CFG cannot express the rule 'tag X must be closed by the same tag X'. Simple nesting is CF, but matching tag names (div...div) is CS. The HTML5 specification describes the parser as a state machine with a tree constructor - which goes beyond a PDA.

JSON has nested objects {}, arrays [], and recursive structures. What type does the language of JSON belong to?

Key Ideas

  • **Type 3 (regular):** DFA/NFA, regex - no memory, finite states only. grep, lexer, form validation
  • **Type 2 (CF):** PDA, CFG - a stack enables processing of nested structure. Parsers, JSON, most PLs
  • **Type 1 (CS):** LBA - bounded-length tape, context-dependent substitutions. Full HTML, some type checks
  • **Type 0 (r.e.):** TM - maximum power, may loop. Dependent types, full type inference
  • Strict containment: Reg is a proper subset of CF is a proper subset of CS is a proper subset of RE
  • SQL injection works because regex (Type 3) cannot verify the structure of SQL (Type 2)

Related Topics

The Chomsky Hierarchy is a map of the entire theory of formal languages:

  • What is a Formal Language — Alphabets, strings, and operations - the basic concepts underlying all types
  • Grammars: Generating Languages — Formal rules for generating strings at each level of the hierarchy
  • DFA and NFA — Recognizers for regular languages (Type 3) - in depth

Вопросы для размышления

  • What type is SQL? Can a finite automaton parse SELECT with nested subqueries?
  • Can a regex verify that a string is a valid arithmetic expression with nested brackets? If not, what is the minimum tool required?
  • PCRE (Perl Compatible Regular Expressions) supports recursion. Does this violate the Chomsky Hierarchy? Why is PCRE technically not 'regular' despite the name?

Связанные уроки

  • fl-01-intro — Alphabets and strings - foundations for all hierarchy types
  • fl-03-grammars — Formal production rules for each hierarchy level
  • fl-06-dfa — DFA - the Type 3 recognizer, in depth
  • comp-02-language-processing — Lexer (Type 3) and parser (Type 2) - hierarchy inside a compiler
  • fl-23-halting-problem — The halting problem lives at the boundary of Type 0
  • plt-02-type-systems — Type system expressiveness correlates with hierarchy levels
  • ml-04
  • comp-01-intro
The Chomsky Hierarchy

0

1

Sign In