Compilers

Lexer Generators: Flex, ANTLR

A hand-written lexer of 1000 lines of code doing exactly what 20 lines in a .l file accomplish - that was the reality for most compilers of the 1960s. Lex arrived in 1975 and turned lexer writing from a week of work into half an hour. Today, Flex sits inside CPython, PostgreSQL, and GCC. ANTLR4 powers Hadoop and Twitter. The generators won.

**CPython** - the Python interpreter's lexer is partially built from lex-style tables
**PostgreSQL** - the SQL lexer is generated by Flex from `scan.l`, handling all keywords and literals
**ANTLR4 in Hadoop/Presto** - HiveQL and Presto SQL are parsed through ANTLR4 grammars
**GCC** - uses a hand-written lexer, but one designed on the same principles as Flex

Flex and lex: generating a lexer from rules

In 1975, Mike Lesk wrote **lex** - a tool that takes a table of regular expressions and produces a C-language lexer. The idea was simple: why write automata by hand when they can be built automatically from token descriptions? **Flex** (Fast Lexical Analyzer) is the GNU replacement for lex, and it still powers CPython, Ruby MRI, and PostgreSQL today.

A `.l` file has three sections separated by `%%`. The first section contains definitions and C header includes. The second contains rules of the form `pattern { action }`. The third contains auxiliary C code. Flex reads the file and produces `lex.yy.c` with an implementation of the `yylex()` function.

**Build:** `flex calc.l` produces `lex.yy.c`. Compile with `gcc lex.yy.c -lfl -o calc`. The `-lfl` flag links the Flex runtime library.

Input string `123abc`. Flex rules: `[0-9]+ { return INT; }` and `[a-z]+ { return ID; }`. What does the first call to yylex() return?

ANTLR4 Lexer: rules inside a grammar file

ANTLR4 (Another Tool for Language Recognition) is a generator that combines the lexer and parser in a single `.g4` file. Created by Terence Parr in 1989, it now powers Hadoop, Hibernate ORM, Twitter's query language, and Presto SQL. Unlike Flex, ANTLR4 generates code in Java, Python, C#, Go, Swift, and TypeScript.

In ANTLR4, lexer rules are written in **UPPERCASE**, parser rules in lowercase. This is not merely a convention - the generator uses capitalization to distinguish the two: the lexer builds tokens from characters, the parser builds a tree from tokens.

**ANTLR vs Flex:** Flex is a lexer-only tool; it pairs with yacc/bison for parsing. ANTLR is a full stack (lexer + parser), better for modern projects. Flex has minimal dependencies and is faster for simple cases.

In an ANTLR4 grammar, the rule `NUMBER : [0-9]+ ;` is written in uppercase. Why?

Token rules and priority

A classic trap: the keyword `if` and the identifier `ifvar` both match the pattern `[a-zA-Z]+`. How does the lexer decide which one it has? Most generators apply two rules: **longest match** (the munch rule) and **order priority** when lengths are equal.

In Flex, rules in the `%%` section are tried top-to-bottom on equal-length matches. In ANTLR4, lexer rules appearing earlier in the file have higher priority. Both systems intentionally make order significant - it gives the grammar author explicit control over conflicts.

**Debugging priorities:** `flex -d` enables debug output - each match is printed to stderr. `antlr4 -Xlog` generates a detailed DFA construction log on the first run of the lexer.

Two Flex rules: `"return" { return KW_RETURN; }` and `[a-z]+ { return ID; }`. Input is `"returns"`. What does the lexer return?

Lexer Modes: context-sensitive tokenization

Some languages give the same character different meanings depending on context. Inside a string `"hello"`, the character `{` is just a letter. Inside a template literal `` `Hello, ${name}!` ``, the `{` starts an interpolated expression. A single automaton cannot handle this without additional state.

**Lexer modes** are named states of the lexer, each with its own set of active rules. ANTLR4 supports them natively: the directive `pushMode(X)` switches to mode X, and `popMode` returns to the previous mode via a stack. Flex implements the same idea through `BEGIN(STATE)` / `INITIAL`.

**Flex equivalent:** `%x TEMPLATE INTERP` declares exclusive states. `BEGIN(TEMPLATE)` switches state. Key difference from ANTLR4: Flex uses a simple variable (not a stack), so nested states require manual stack management.

Lexer modes are the same as parser contexts and can be replaced by more complex regular expressions.

Modes are finite automata with multiple initial states. Regular languages are closed under concatenation but cannot 'remember' which section of the input they are processing.

Regular expressions describe regular languages (memoryless automata). Context such as 'inside a string' requires at least one bit of memory. Modes add that memory explicitly and in a controlled way.

Why are lexer modes needed if regular expressions are sufficient to describe individual tokens?