Information Retrieval
TF-IDF and BM25
Elasticsearch holds an index of 50 billion documents and answers in 200 milliseconds. BM25 - an algorithm from 1994 - is still inside. Not because nobody had time to replace it. Because nothing matches it at that scale for speed. BERT processes one document in 10 ms. BM25 does it in microseconds. Understanding why this 30-year-old formula still wins means understanding the architecture of every modern search engine.
- **Elasticsearch** and **Apache Solr** use BM25 as the default ranking algorithm for millions of applications
- **Amazon Product Search** combines BM25 over title, description, and reviews to rank products
- **GitHub Code Search** uses field weighting: a match in the filename weighs more than one in the content
Предварительные знания
Robertson and the Birth of BM25
1994. Stephen Robertson and colleagues at City University London publish Okapi BM25 as part of the Okapi system for the TREC competition. BM stands for "Best Match" - a probabilistic ranking model. "25" is the formula's iteration number. Before BM25 came BM1, BM11, BM15 - each version patching problems in the previous one. TF saturation (parameter k1) and length normalization (parameter b) are the key wins over Karen Sparck Jones's 1972 TF-IDF. Thirty years later BM25 still ships as the default in Elasticsearch, Solr, and virtually every search system on the planet.
TF-IDF
Not all words pull equal weight in search. "The" shows up in every English document - it tells nothing about topic. But "quantum" is rare and instantly narrows the field. **TF-IDF** turns this intuition into numbers: a word's weight rises with its frequency in the document (TF) and its rarity across the collection (IDF).
In practice, **smoothed IDF** is the norm: `log(1 + N/df)` - dodges division by zero and tames extreme weights. TF often gets logarithmic scaling too: `1 + log(TF)`, because 10 occurrences of a word are not 10 times more useful than one.
Origins of TF-IDF
Karen Sparck Jones proposed the IDF concept in 1972 in the paper "A statistical interpretation of term specificity and its application in retrieval". She argued that high-specificity terms (rare in the collection) carry the most discriminative power. TF-IDF in its modern form took shape in Gerard Salton's work (SMART system, Cornell, 1960s-80s). Every search engine on the planet - from Elasticsearch to Google's first retrieval pass - still leans on this idea.
TF-IDF is not one algorithm but a whole **family** of variants. SMART notation captures combinations: `lnc.ltc` means log TF + no IDF + cosine norm for the document, log TF + IDF + cosine norm for the query. Dozens of variants exist.
A word appears in 10 out of 10,000 documents. It appears 5 times in one of those documents. What is true?
BM25
TF-IDF has problems: TF grows linearly (a document with a word 100 times gets 100x more weight than one with a single occurrence - wrong). **Okapi BM25** is a probabilistic ranking model that fixes these issues and has stayed the gold standard of search for 30 years.
The **k1** parameter controls TF saturation. At k1 = 0 the weight ignores frequency (binary model). As k1 → ∞ BM25 collapses back to linear TF-IDF. k1 = 1.2 means: the first few occurrences of a word matter, but the 100th barely moves the needle.
The **b** parameter controls document length normalization. b = 0 ignores length. b = 1 hammers long documents. b = 0.75 is the compromise: a long document can still be relevant, but length should not buy an unfair edge by simply containing more words.
| TF in document | TF-IDF weight | BM25 weight (k1=1.2) |
|---|---|---|
| 1 | 1.0 × IDF | 1.1 × IDF |
| 5 | 5.0 × IDF | 1.9 × IDF |
| 20 | 20.0 × IDF | 2.1 × IDF |
| 100 | 100.0 × IDF | 2.2 × IDF |
| 1000 | 1000.0 × IDF | 2.2 × IDF |
BM25 parameter k1 = 1.2. What happens if we increase it to 10?
Field Weighting
Real documents come structured: title, body, URL, meta description. A word in the **title** signals the document's topic far more loudly than the same word buried in the body. **Field weighting** assigns different weights to different document fields.
**BM25F** (BM25 with Field weighting) is a BM25 extension that computes a weighted sum of TF across fields. In Elasticsearch this is implemented via a **multi_match** query with boost parameters:
**Anchor text** - the text of hyperlinks pointing to a page - is a particularly valuable signal. If 1,000 pages link to a site with the text "best Python course", the site is probably exactly that. Google leans heavily on anchor text for ranking.
| Field | Typical boost | Why |
|---|---|---|
| title | ×5 | The title directly describes the document's topic |
| anchor text | ×4 | What other sites say about the content |
| h1-h3 | ×3 | Structural headings indicate key topics |
| URL | ×3 | URLs often contain keywords |
| meta description | ×2 | Short description for search engines |
| body | ×1 | Main text - base weight |
Boost coefficients are empirical work. In practice they get tuned through **A/B tests** and **learning to rank**: a model trains on labeled data (query-document pairs with relevance scores) and learns optimal field weights.
Why does a word in the title get a higher weight than in the document body?
Document Length Normalization
A long document carries more words - and statistically more chances that any query word lands inside it. Without normalization, long documents would unfairly dominate the results. **Document length normalization** offsets this effect.
In BM25 the **b** parameter dials the strength of normalization. b = 1 hammers long documents. b = 0 ignores length entirely. Optimal b depends on the collection: homogeneous documents (scientific abstracts) call for lower b; heterogeneous (web) calls for higher.
| Parameter b | Effect | When to use |
|---|---|---|
| b = 0 | Length has no effect | All documents roughly the same length |
| b = 0.3 | Weak normalization | Scientific papers (length varies little) |
| b = 0.75 | Standard (default) | General case, web documents |
| b = 1.0 | Maximum normalization | Very heterogeneous collection |
An alternative: **cosine normalization** (divide the TF-IDF vector by its magnitude). Used in TF-IDF, not in BM25. Another approach - **pivoted document length normalization**: set a "pivot" point; documents shorter than it get a bonus, longer ones take a penalty.
Length normalization is just one piece. Modern systems also factor in **query length** (long queries need different handling), **coordination factor** (bonus for matching more query words), and **proximity** (query words clustering close together in a document is a good signal).
TF-IDF is obsolete with the arrival of BERT and neural networks - nobody uses it anymore
BM25 is still the first-stage retrieval method in virtually all search systems, including Google
Neural models (BERT, ColBERT) are used at the re-ranking stage - reordering the top 100 results. But the initial selection from billions of documents is still done by BM25: it works in milliseconds on huge indices, whereas processing one document with BERT takes ~10 ms. This is the retrieve-then-rerank architecture
Document A (200 words) and document B (2000 words) each contain the query word 4 times. With b = 0.75, avgdl = 500, which gets a higher BM25 score?
Key Ideas
- **TF-IDF** = frequency in document × rarity in collection: rare words matter more than common ones
- **BM25** solves TF-IDF problems: TF saturation (parameter k1) and length normalization (parameter b)
- **Field weighting** (BM25F): a word in the title is ×5 more important than in the body ×1
- **Retrieve-then-rerank** architecture: BM25 quickly selects top-1000, neural models rerank
Related Topics
TF-IDF and BM25 connect information retrieval with machine learning and NLP:
- Inverted Index — BM25 scores are computed from TF and DF stored in posting lists
- Introduction to Information Retrieval — BM25 is a concrete implementation of the ranking concept from the first lesson
Вопросы для размышления
- Why does BM25 use a saturation constraint on TF instead of linear TF? In what cases would linear TF be better?
- How should the k1 and b parameters be tuned for an online store search engine (products with short descriptions vs reviews)?
- If BERT is so good at understanding meaning, why can't we simply replace BM25 with BERT for all 100B documents?
Связанные уроки
- ir-02 — Inverted index stores TF and DF that BM25 needs
- ir-01 — Core IR concepts: relevance and ranking
- ir-04 — Vector space model and cosine similarity follow from TF-IDF
- aie-13-advanced-rag — Hybrid search in RAG combines BM25 with dense retrieval
- nlp-03 — NLP preprocessing: stemming and stopwords affect TF-IDF quality
- rec-03 — TF-IDF and matrix factorization both extract signal from sparse data
- stat-03-mle