Information Retrieval

Inverted Index

Google processes 8.5 billion queries a day. PageRank is no longer the main algorithm. BERT, RankBrain, MUM, Multitask Unified Model - every query passes through 200+ signals. It all started with the inverted index of the 1960s. When a Google search takes 200 milliseconds, 190 of those are spent fetching from the index. Without it, none of the modern algorithms could run in real time.

**Lucene/Elasticsearch** - open-source engine powering search at Wikipedia, GitHub, Netflix, eBay; the index updates in near-realtime through segment merging
**Google** stores an index hundreds of petabytes in size, distributed across thousands of servers; each query traverses a multi-level index hierarchy
**VS Code** (Find in Files) uses a simplified inverted index for instant project-wide search - milliseconds instead of seconds of linear scan
**Cloudflare** uses Lucene-based indexing to search across 100+ trillion DNS records and log events in real time
**Typesense and Meilisearch** - modern open-source search engines with inverted index + BM25 + typo-tolerance as hosted services

Предварительные знания

Introduction to Information Retrieval

Gerald Salton and the Birth of Modern Search

In the 1960s, Gerald Salton at Cornell University created the **SMART system** (System for the Mechanical Analysis and Retrieval of Text) - the first prototype of modern search. Salton formalized the concept of an inverted index, introduced **TF-IDF** (term frequency - inverse document frequency) and the vector space model. Without this work, neither Lucene nor Google would exist in their current form. Salton's 1975 paper 'A Vector Space Model for Automatic Indexing' is the foundation on which all modern search systems rest.

Posting List

Google processes 8.5 billion queries a day. PageRank is no longer the main algorithm. BERT, RankBrain, MUM, Multitask Unified Model - every query passes through 200+ signals. It all started with the inverted index of the 1960s. Without it, none of these algorithms could run in 200 milliseconds.

An inverted index consists of two parts: a **dictionary** - the list of all unique words - and **posting lists** - for each word, the list of documents in which it appears. The posting list is the heart of a search engine.

To answer a multi-word query (e.g., "binary algorithm") the system needs to find the **intersection** of their posting lists. The query "binary AND algorithm" means: find documents that are in both lists.

Posting lists are **always sorted** by doc_id. This allows intersection in O(n + m) instead of O(n * m). For a query with k words, it is optimal to start the intersection with the shortest list.

Besides intersection (AND), the following operations are supported: **union** (OR) - merge of two lists, and **subtraction** (NOT) - set difference. Together they allow boolean queries: `(python OR java) AND NOT javascript`.

Why are posting lists stored in sorted order by doc_id?

Tokenization

Before building an index, text needs to become a list of **tokens**. Tokenization is the process of splitting text into indexing units. Sounds simple - just split on spaces? In practice it is far more complex. Gerald Salton and the SMART system in the 1960s first formalized this process - and laid the foundation for TF-IDF, on which modern search rests.

Problem	Example	Solution
Case	Python vs python	Lowercasing
Punctuation	end. vs end	Strip punctuation
Stop words	the, is, at, on	Remove (but careful - "to be or not to be")
Stemming	running, runs, ran	Reduce to stem: run
Lemmatization	better -> good	Reduce to lemma (dictionary form)
Compound words	New York, ice cream	N-grams or NER
Numbers/dates	01/04/2026	Normalize format

**Stemming vs Lemmatization**: stemming (Porter Stemmer) chops off suffixes by rules (fast but rough: "university" -> "univers"). Lemmatization uses a dictionary and morphology (more accurate but slower: "better" -> "good"). Most search engines use stemming + a synonym dictionary.

Tokenization is especially challenging for **CJK languages** (Chinese, Japanese, Korean) - there are no spaces between words. For morphologically rich languages a dedicated stemmer is needed to account for cases, declensions, and conjugations.

The result of tokenization critically affects search quality. Too aggressive stemming will merge different words (e.g., "arm" and "army"). Too lenient will miss different forms of the same word.

A user searches for "running shoes". Which processing step will help find a document containing "run in new shoes"?

Compression

A posting list for a frequent word like "the" may contain billions of doc_ids. Google stores an index hundreds of petabytes in size - without compression that would be impossible. **Delta encoding + Variable Byte** reduces index size by 4-10x. This is not an engineering detail - it is a condition for the system to be viable.

The key idea is **delta encoding** (gap encoding). Instead of absolute doc_ids, store the differences (gaps) between consecutive values. Since the lists are sorted, gaps are always positive and often small.

After delta encoding, small gaps are encoded with **Variable Byte Encoding** (VByte): 7 bits per byte for data and 1 bit as a continuation flag. The number 5 takes 1 byte instead of 4. Elasticsearch and Lucene use VByte. Google uses more efficient encodings: **PForDelta** and **SIMD-BP128**.

Compression method	Decode speed	Compression ratio	Usage
Variable Byte	Very fast	~50%	Lucene, Elasticsearch
Gamma Coding	Medium	~40%	Theoretically optimal for small gaps
PForDelta	Fast (SIMD)	~35%	Google, modern engines
Simple-9	Fast	~45%	Compact posting lists

Dictionary compression is a separate task. **Front coding** is used: common prefixes are stored once (auto -> autom -> automat -> automatic -> automobile). This saves up to 50% of dictionary space.

Posting list [100, 103, 105, 200, 201]. What does it look like after delta encoding?

Skip Lists

Intersecting two posting lists of length n and m takes O(n + m). Can it be faster? **Skip lists** allow "jumping over" blocks of elements that are definitely not needed, speeding up intersection to O(sqrt(n)). This is the same idea as binary search, but built into the data structure.

The optimal gap between skip pointers is **sqrt(L)**, where L is the length of the posting list. For a list of 10,000 elements, place a skip pointer every 100 elements. Too frequent - wastes memory. Too sparse - no speedup.

In practice, pure skip lists are found in educational implementations. Industrial systems (Lucene, Elasticsearch) use **block-based indices**: the posting list is split into blocks of 128 doc_ids, each block compressed separately, and skip pointers point to block starts. This combines compression and fast skipping.

Another optimization - **query optimization**: for an AND query with multiple words, start the intersection with the shortest posting list. For the query "the AND algorithm AND quantum" first intersect "quantum" (rare) with "algorithm" (medium), and deal with "the" (very frequent) last or skip it altogether.

Optimization	Idea	Speedup
Skip lists	Jump over irrelevant elements	O(n+m) -> O(sqrt(n))
Query reordering	Start with shortest list	Reduces intermediate results
Block-based index	Blocks of 128 + skip between them	SIMD parallelism + compression
Early termination	Stop when top-k is stable	Saves work on long tails

An inverted index is just a hash table (word -> list of documents)

Posting lists are sorted, compressed with delta encoding + VByte, equipped with skip pointers, and store not only doc_id but also positions, frequencies, and other metadata

A hash table does not support efficient list intersection, compression via delta encoding, or phrase search. A real inverted index is a complex multi-layer structure. Lucene uses block-based posting lists, an FST dictionary, and BM25 normalization. That is decades of engineering built on top of Salton's idea from the 1960s.

What is the optimal skip pointer stride for a posting list of length 10,000?

Key Ideas

**Posting list** - sorted list of doc_ids for each term; intersection in O(n + m) via two-pointer algorithm
**Tokenization** - converting text to terms: lowercase, stemming, stop word removal. Gerald Salton and the SMART system in the 1960s - the foundation of modern approaches
**Delta encoding + VByte** compresses posting lists by 4-10x by encoding differences between doc_ids. Google uses PForDelta and SIMD-BP128
**Skip lists** speed up intersection to O(sqrt(n)), allowing jumps over irrelevant blocks. Optimal stride: sqrt(L)
**Block-based indices** in Lucene/Elasticsearch: 128 doc_ids per block, compression + skip between blocks - SIMD parallelism in production

Вопросы для размышления

Why are stop words (the, is, a) typically excluded from the index? In what cases could this backfire?
How would phrase search be implemented ('machine learning' as a phrase, not two separate words)?
If a posting list contains 1 million doc_ids, how many skip pointers would be optimal?

Связанные уроки

ir-01 — Base IR model before diving into index structure
ir-03 — TF-IDF and BM25 are built on top of posting lists with term frequency
alg-12-bfs — Graph traversal via queue - the same idea as merging two sorted lists
ds-24-bloom-filter — Bloom filter - probabilistic structure for fast membership check in an index
aie-10-vector-databases — Vector DB - modern alternative: ANN instead of inverted index for semantic search
alg-26-two-pointers — Two-pointer technique - the exact mechanics of posting list intersection
ml-01-intro