Information Retrieval

TF-IDF and BM25

Elasticsearch holds an index of 50 billion documents and answers in 200 milliseconds. BM25 - an algorithm from 1994 - is still inside. Not because nobody had time to replace it. Because nothing matches it at that scale for speed. BERT processes one document in 10 ms. BM25 does it in microseconds. Understanding why this 30-year-old formula still wins means understanding the architecture of every modern search engine.

**Elasticsearch** and **Apache Solr** use BM25 as the default ranking algorithm for millions of applications
**Amazon Product Search** combines BM25 over title, description, and reviews to rank products
**GitHub Code Search** uses field weighting: a match in the filename weighs more than one in the content

Предварительные знания

Inverted Index

Robertson and the Birth of BM25

1994. Stephen Robertson and colleagues at City University London publish Okapi BM25 as part of the Okapi system for the TREC competition. BM stands for "Best Match" - a probabilistic ranking model. "25" is the formula's iteration number. Before BM25 came BM1, BM11, BM15 - each version patching problems in the previous one. TF saturation (parameter k1) and length normalization (parameter b) are the key wins over Karen Sparck Jones's 1972 TF-IDF. Thirty years later BM25 still ships as the default in Elasticsearch, Solr, and virtually every search system on the planet.

TF-IDF

Not all words pull equal weight in search. "The" shows up in every English document - it tells nothing about topic. But "quantum" is rare and instantly narrows the field. **TF-IDF** turns this intuition into numbers: a word's weight rises with its frequency in the document (TF) and its rarity across the collection (IDF).

In practice, **smoothed IDF** is the norm: `log(1 + N/df)` - dodges division by zero and tames extreme weights. TF often gets logarithmic scaling too: `1 + log(TF)`, because 10 occurrences of a word are not 10 times more useful than one.

Origins of TF-IDF

Karen Sparck Jones proposed the IDF concept in 1972 in the paper "A statistical interpretation of term specificity and its application in retrieval". She argued that high-specificity terms (rare in the collection) carry the most discriminative power. TF-IDF in its modern form took shape in Gerard Salton's work (SMART system, Cornell, 1960s-80s). Every search engine on the planet - from Elasticsearch to Google's first retrieval pass - still leans on this idea.

TF-IDF is not one algorithm but a whole **family** of variants. SMART notation captures combinations: `lnc.ltc` means log TF + no IDF + cosine norm for the document, log TF + IDF + cosine norm for the query. Dozens of variants exist.

A word appears in 10 out of 10,000 documents. It appears 5 times in one of those documents. What is true?

BM25

TF-IDF has problems: TF grows linearly (a document with a word 100 times gets 100x more weight than one with a single occurrence - wrong). **Okapi BM25** is a probabilistic ranking model that fixes these issues and has stayed the gold standard of search for 30 years.

The **k1** parameter controls TF saturation. At k1 = 0 the weight ignores frequency (binary model). As k1 → ∞ BM25 collapses back to linear TF-IDF. k1 = 1.2 means: the first few occurrences of a word matter, but the 100th barely moves the needle.

The **b** parameter controls document length normalization. b = 0 ignores length. b = 1 hammers long documents. b = 0.75 is the compromise: a long document can still be relevant, but length should not buy an unfair edge by simply containing more words.

TF in document	TF-IDF weight	BM25 weight (k1=1.2)
1	1.0 × IDF	1.1 × IDF
5	5.0 × IDF	1.9 × IDF
20	20.0 × IDF	2.1 × IDF
100	100.0 × IDF	2.2 × IDF
1000	1000.0 × IDF	2.2 × IDF

BM25 parameter k1 = 1.2. What happens if we increase it to 10?

Field Weighting

Real documents come structured: title, body, URL, meta description. A word in the **title** signals the document's topic far more loudly than the same word buried in the body. **Field weighting** assigns different weights to different document fields.

**BM25F** (BM25 with Field weighting) is a BM25 extension that computes a weighted sum of TF across fields. In Elasticsearch this is implemented via a **multi_match** query with boost parameters:

**Anchor text** - the text of hyperlinks pointing to a page - is a particularly valuable signal. If 1,000 pages link to a site with the text "best Python course", the site is probably exactly that. Google leans heavily on anchor text for ranking.

Field	Typical boost	Why
title	×5	The title directly describes the document's topic
anchor text	×4	What other sites say about the content
h1-h3	×3	Structural headings indicate key topics
URL	×3	URLs often contain keywords
meta description	×2	Short description for search engines
body	×1	Main text - base weight

Boost coefficients are empirical work. In practice they get tuned through **A/B tests** and **learning to rank**: a model trains on labeled data (query-document pairs with relevance scores) and learns optimal field weights.

Why does a word in the title get a higher weight than in the document body?

Document Length Normalization

A long document carries more words - and statistically more chances that any query word lands inside it. Without normalization, long documents would unfairly dominate the results. **Document length normalization** offsets this effect.

In BM25 the **b** parameter dials the strength of normalization. b = 1 hammers long documents. b = 0 ignores length entirely. Optimal b depends on the collection: homogeneous documents (scientific abstracts) call for lower b; heterogeneous (web) calls for higher.

Parameter b	Effect	When to use
b = 0	Length has no effect	All documents roughly the same length
b = 0.3	Weak normalization	Scientific papers (length varies little)
b = 0.75	Standard (default)	General case, web documents
b = 1.0	Maximum normalization	Very heterogeneous collection

An alternative: **cosine normalization** (divide the TF-IDF vector by its magnitude). Used in TF-IDF, not in BM25. Another approach - **pivoted document length normalization**: set a "pivot" point; documents shorter than it get a bonus, longer ones take a penalty.

Length normalization is just one piece. Modern systems also factor in **query length** (long queries need different handling), **coordination factor** (bonus for matching more query words), and **proximity** (query words clustering close together in a document is a good signal).

TF-IDF is obsolete with the arrival of BERT and neural networks - nobody uses it anymore

BM25 is still the first-stage retrieval method in virtually all search systems, including Google

Neural models (BERT, ColBERT) are used at the re-ranking stage - reordering the top 100 results. But the initial selection from billions of documents is still done by BM25: it works in milliseconds on huge indices, whereas processing one document with BERT takes ~10 ms. This is the retrieve-then-rerank architecture

Document A (200 words) and document B (2000 words) each contain the query word 4 times. With b = 0.75, avgdl = 500, which gets a higher BM25 score?

Key Ideas

**TF-IDF** = frequency in document × rarity in collection: rare words matter more than common ones
**BM25** solves TF-IDF problems: TF saturation (parameter k1) and length normalization (parameter b)
**Field weighting** (BM25F): a word in the title is ×5 more important than in the body ×1
**Retrieve-then-rerank** architecture: BM25 quickly selects top-1000, neural models rerank

Вопросы для размышления

Why does BM25 use a saturation constraint on TF instead of linear TF? In what cases would linear TF be better?
How should the k1 and b parameters be tuned for an online store search engine (products with short descriptions vs reviews)?
If BERT is so good at understanding meaning, why can't we simply replace BM25 with BERT for all 100B documents?

Связанные уроки

ir-02 — Inverted index stores TF and DF that BM25 needs
ir-01 — Core IR concepts: relevance and ranking
ir-04 — Vector space model and cosine similarity follow from TF-IDF
aie-13-advanced-rag — Hybrid search in RAG combines BM25 with dense retrieval
nlp-03 — NLP preprocessing: stemming and stopwords affect TF-IDF quality
rec-03 — TF-IDF and matrix factorization both extract signal from sparse data
stat-03-mle