Compilers

Lexical Analysis: Basics

1975. Bell Labs. Mike Lesk takes a weekend and writes `lex` - a tool that generates lexers from regular expressions. Before him, every compiler had a handwritten lexer: hundreds of lines of switch-case, bugs in string literal handling, one person dedicated just to the scanner. After Lesk's weekend - one page of regex rules, everything else generated automatically. GCC, Python, Ruby - the principle is the same today.

  • PostgreSQL scan.l: a flex lexer tokenizes SQL into keywords, identifiers, and dollar-quoted strings - on every single query
  • Clang handwritten lexer: a hand-coded DFA with SIMD optimizations processes roughly 500 MB/s of C++ source - faster than any flex output
  • Tree-sitter in GitHub Copilot: an incremental lexer+parser re-tokenizes only the changed lines on every keystroke
  • Python tokenize: the CPython lexer generates INDENT/DEDENT tokens from indentation - the foundation of whitespace significance in Python

Tokens and Lexemes

1975. Bell Labs. Mike Lesk takes a weekend and writes `lex` - a tool that generates lexers from regular expressions. Before that, every compiler had a handwritten lexer: hundreds of switch-case lines, a dedicated bug tracker just for the scanner. After Lesk's weekend - one table of regex rules, the rest is generated. GCC, Python, Ruby, Java - all of them use this principle to this day. But to understand why lex worked, one needs to understand what exactly it does.

A lexer is the first phase of a compiler. Its job: transform a stream of characters (raw program text) into a stream of **tokens**. A token is a structural unit: type + value. The type says what it is (keyword, number, identifier), the value holds the actual text.

A **lexeme** is the raw text that matched a token. For the token `[NUMBER "42"]` the lexeme is the string `"42"`. The terminology is confusing: different sources swap token and lexeme. What matters: there is a category (type) and concrete text (lexeme).

The lexer typically discards whitespace, tabs, and comments - they carry no structural meaning for the parser. Exception: Python, where indentation is semantically significant. The CPython lexer generates special INDENT and DEDENT tokens whenever the indentation level changes.

In real systems, one token type can capture a wide class of lexemes. Clang distinguishes more than 400 token kinds. The Rust lexer encodes `>>` both as a single SHR token and as two GT GT tokens - because in Rust `Vec<Vec<i32>>` must close two nested generic parameters, not be a right-shift. Semantics forces different decisions at the lexer level.

In modern language servers (LSP), the lexer runs incrementally - only over the changed text range. Tree-sitter achieves this through incremental parsing: O(log n) relative to the change size, not O(n) relative to the file size.

Source text: `while (x > 0)`. How many tokens does the lexer produce (whitespace discarded)?

DFA Lexer and Longest Match Rule

How does the lexer decide where one token ends and the next begins? The naive approach - try all rules and take the first match - breaks immediately. `>=` can be read as `>` and `=` separately, or as a single `>=` operator. The correct answer is always the second. This is the **longest match rule (maximal munch)**.

Under the hood, a lexer is a **Deterministic Finite Automaton (DFA)**. Each state is a position in recognizing a token. Transitions happen on characters. Accepting states mark where a token is complete. A DFA runs in O(n) of input length: each character is processed exactly once.

When a lexer has multiple rules, conflicts are possible: the string `return` matches both the identifier pattern `[a-zA-Z]+` and the keyword `return`. The standard resolution: **rule priority**. Keywords are listed before identifiers. In lex/flex, the rule written first wins when two rules produce a match of equal length.

GCC and Clang do not use generated lexers - they are handwritten for performance. Clang's hand-coded lexer processes roughly 500 MB/s of source code on a modern CPU. A hand-rolled implementation allows SIMD instructions to scan whitespace and comments a full vector of characters at a time.

An important edge case: what to do with a character that matches no rule? The lexer must generate a **lexical error** with the position (line:column). Good lexers continue analysis after the error - recovery mode. Clang on a lexical error tries to skip the character and continue, to surface as many errors as possible in a single pass.

A lexer sees `===`. What does it recognize by the longest match rule, given tokens `=` (ASSIGN) and `==` (EQ)?

Lex, Flex, and Lexer Generators

Lesk's insight in 1975 was simple: every lexer does the same thing - convert regular expressions into a DFA. So one could write a tool that generates a lexer from a table of regex rules. That was `lex`. Its Open Source successor - `flex` (Fast Lexical Analyzer) - is still used today in GNU coreutils, PostgreSQL, and Bash.

Flex compiles this into C code with a DFA table. The table encodes all automaton states and transitions. Given a character stream as input, it produces a token stream through calls to `yylex()`. All logic for longest match, backtracking, and priorities is already embedded in the generated code.

Modern alternatives to flex have found new life in ML tooling. ANTLR4 generates lexers for Java, Python, JavaScript, and C# from a single grammar. Tree-sitter (used in GitHub Copilot and Neovim) generates incremental parsers with a built-in lexer. The logic is the same as Lesk's - regular expressions compiled to automata - but now with support for incremental re-parsing when a file changes in an editor.

PostgreSQL uses flex for its SQL lexer. The file `scan.l` contains rules for SQL keywords, string literals (including dollar-quoting via $$...$$), numeric literals, and operators. Every SQL query starts processing through this lexer - millions of times per second on loaded databases.

A flex file has two rules: `[a-zA-Z]+` returns IDENT, and `"if"` returns KEYWORD_IF. What does the lexer return for the string `if`?

Key Ideas

  • **Token** = type + lexeme. A lexer converts a character stream into a token stream in O(n) via a DFA
  • **Longest match rule**: the lexer is greedy - it takes the longest possible lexeme. `>=` is one token, not two
  • **DFA underneath**: each regex rule compiles to an NFA, combined, then converted to a DFA. Flex does this automatically
  • **Rule priority**: when two rules produce equal-length matches, the one listed first wins - that is why keywords come before identifiers
  • **Lex/Flex**: regex rules -> generated C lexer. PostgreSQL, GNU coreutils, and Bash still use flex today

Related Topics

Lexical analysis is the foundation for all subsequent language processing:

  • Regular Expressions in the Lexer — Mathematical basis for describing tokens through regex and NFA/DFA
  • Syntax Analysis (Parser) — The parser receives the token stream from the lexer and builds an AST
  • Compiler Architecture — The lexer is the first of three core phases: lexer, parser, semantics

Вопросы для размышления

  • Why did Clang write a lexer by hand instead of using flex, if flex automates the whole process?
  • Python uses significant indentation. How does the lexer distinguish an indentation increase from a line continuation?
  • A language has a `->` operator and separate `-` and `>`. How does the longest match rule handle `->`?

Связанные уроки

  • comp-05-history — Context of compiler phases
  • comp-07-regex-in-lexer — Regex formalizes lexer pattern matching
  • comp-08-handwritten-lexer — Hands-on implementation of lexer concepts
  • fl-05-regex — Formal regex theory underpins lexer patterns
  • comp-03-compiler-architecture — Lexer is front-end stage of compiler pipeline
  • plt-24-lexer
Lexical Analysis: Basics

0

1

Sign In