Data Structures
Trie: A Tree for Strings
Предварительные знания
- Introduction to trees
From de la Briandais to the name 'trie'
Rene de la Briandais described the prefix-tree structure in 1959 as a way to store file directories without rescanning entire keys. In 1960 Edward Fredkin coined the name 'trie', taken from the middle of the word reTRIEval, which is why some people still pronounce it 'tree' and others say 'try'. The structure later became core infrastructure for IP routing tables, where longest-prefix match decides where every packet goes, and for autocomplete systems that resolve a prefix into ranked suggestions in milliseconds.
GPT-4's tokenizer holds a vocabulary of 50,257 tokens. Every single tokenization - zero linear scan - runs on a trie. The same structure handles Google autocomplete, VS Code IntelliSense, and every IP router on the internet. One idea, everywhere string prefixes matter.
- Google autocomplete - billions of queries per second, sub-millisecond prefix lookup
- LLM tokenizers (tiktoken, sentencepiece) - BPE vocabulary indexed as a trie
- VS Code IntelliSense - symbol prefix matching across entire codebase
- IP routing - Patricia trie for longest-prefix match, nanosecond packet forwarding
- Aho-Corasick antivirus - simultaneous search for thousands of virus signatures
The Autocomplete Problem
GPT-4's tokenizer holds a vocabulary of 50,257 tokens. Every time the model generates text, that vocabulary is indexed by a prefix structure - a trie variant. The same structure that powers Google's search suggestions, VS Code's IntelliSense, and the autocorrect on every smartphone.
The naive approach stores words in a hash set and scans linearly for prefix matches. With one million words that scan is O(n * m) per query, where m is the prefix length. At Google's scale - billions of queries per second - that's physically impossible.
A Trie (from *retrieval*) eliminates the scan entirely. Instead of storing complete words, it encodes each character as an edge. The path from root to any node spells out a prefix. All words sharing that prefix live in the subtree below it.
BPE tokenization in modern LLMs - the process that converts "programming" into ["program", "ming"] - builds a trie over the merge vocabulary at initialization. Every tokenization call traverses it.
A Trie accelerates prefix lookups because:
Node Design and the Array-vs-Dict Trade-off
Two implementations exist for `children`. The dict variant uses O(k) space where k is the number of distinct children - sparse, flexible, handles Unicode. The array variant pre-allocates 26 slots for lowercase ASCII - wastes memory at shallow nodes, but each character lookup is a direct index dereference instead of a hash computation.
Production autocomplete engines at Elasticsearch and Lucene use a compressed variant called DAWG (Directed Acyclic Word Graph), which merges common suffixes too - not just common prefixes. A full English dictionary compresses from ~10 MB to ~2 MB.
The `is_end` flag is the critical detail. Without it, the word 'car' and the prefix 'car' (inside 'card') are indistinguishable. Every node on the path to 'card' exists, but only the node for 'car' has `is_end=True`.
Why does TrieNode need an `is_end` flag instead of just tracking which nodes exist?
Inserting a Word: O(m) Independent of Vocabulary Size
Insert walks m characters. When a node is missing it allocates one; otherwise it steps forward into the existing child. The vocabulary can grow to 10 million words and the insert cost stays O(m). That is the decisive advantage over sorted arrays at O(m log n) and over a linear scan at O(n * m).
Inserting a word of length m into a Trie with n existing words costs:
Search and starts_with
Both operations share `_find_node`, which walks the prefix path in O(m). The only difference is the final check: `search` demands `is_end=True` at the landing node, while `starts_with` just needs the node to exist. The word 'car' returns true for `search` only if 'car' was inserted as a complete word, not merely as the prefix of 'card'.
This distinction drives LLM tokenization. The tokenizer relies on `starts_with` to greedily match the longest token: at each position in the input it follows the trie as far as it can, then backtracks one step when no child matches. Autocomplete builds directly on `starts_with`: find the prefix node, then walk its subtree to collect every completion.
The difference between search and starts_with is:
Applications: Autocomplete, Routing, Multi-Pattern Search
- **Autocomplete** - Google search, IDE symbol completion, terminal history
- **Spell checking** - fast existence test for dictionary words
- **IP routing** - longest-prefix match in routers and switches
- **T9 input** - the old keypad text-entry on feature phones
- **Bioinformatics** - searching DNA and protein sequences
Autocomplete is the canonical case. Find the prefix node, then run a bounded DFS over its subtree. A `limit` parameter stops the walk after the first few hits, which is what a real-time UI needs when it shows only the top 10 completions.
IP routing hardware uses a 256-ary trie (one level per byte of the address) known as a Patricia trie. Every packet switch on the internet matches destination IPs against routing tables this way, in nanoseconds. The Aho-Corasick algorithm, used by antivirus engines, intrusion detection, and the `grep -F` flag, extends a trie with failure links to search for thousands of patterns in a single pass over the text.
| Operation | Trie | HashSet | BST |
|---|---|---|---|
| Word lookup | O(m) | O(m) avg | O(m log n) |
| Prefix lookup | O(m) | O(n m) | O(n m) |
| Autocomplete | O(m + k) | O(n m) | O(n m) |
A Trie wins when you need prefix operations. For exact-match lookups a HashSet is usually enough.
A Trie uses more memory than a HashSet, so it should be avoided when memory is tight.
Memory depends on structure. A compressed trie (DAWG) for a 100k-word dictionary can fit in 2 MB. A HashSet storing the same words with full string copies may use more. The trade-off is: Trie pays per unique character in the dataset; HashSet pays per character per word.
Shared prefixes in a Trie collapse storage. 1000 words starting with 'pre-' share the same three nodes for p-r-e.
For which task is a HashSet strictly better than a Trie?
Next Topics
Tries encode prefix structure. Heaps encode priority order. The next lesson introduces a tree where the root is always the minimum - enabling O(1) minimum access and powering Dijkstra, beam search, and every scheduler.
- Heap — Priority order tree - O(1) min access
- Pattern matching — Aho-Corasick extends Trie for multi-pattern search
Key Ideas
- Trie = prefix tree: edges are characters, paths are prefixes, is_end marks complete words
- All operations O(m), independent of vocabulary size n
- Prefix enumeration O(m+k) - impossible in O(m) with a HashSet
- Production variants: DAWG (merged suffixes), Patricia trie (compressed paths), radix tree
- Core of LLM tokenizers, autocomplete engines, IP routing, multi-pattern search
Вопросы для размышления
- A Trie behaves like a finite automaton: each node is a state, each edge a character transition. What other string algorithms exploit this automaton view?
- Spell-checkers use tries with edit-distance thresholds (fuzzy matching). How would the DFS change to allow up to k character mismatches?
- A ranked autocomplete needs the most frequent completions first, not alphabetical. What data would need to be added to each Trie node?
Связанные уроки
- alg-23-pattern-matching — Trie enables sub-linear multi-pattern search (Aho-Corasick)
- alg-24-kmp — Both exploit prefix structure to avoid redundant comparisons
- alg-01-big-o — O(m) vs O(n*m) distinction only makes sense with complexity literacy
- ds-06-hash-intro — HashSet vs Trie: same insert/search cost, different prefix power
- ds-14-heaps-intro — Trie+Heap combination powers ranked autocomplete in production
- ds-12-balanced-trees
- ds-17-graph-algorithms
- aie-12-rag-fundamentals
- comp-06-lexer-basics