Machine Learning
Search and Ranking
Google processes 8.5 billion search queries per day and returns relevant results in 0.2 seconds. The difference between showing the right result at position 1 versus position 10 is billions of dollars in revenue. Search ranking is the most scaled ML system in the world.
- **Google Search** - 8.5 billion queries per day are processed through a cascade of BM25 retrieval, bi-encoder filtering, and cross-encoder re-ranking, with each stage improving ranking quality by 5-15% NDCG over the previous one
- **YouTube recommendations** - the two-tower architecture selects top-1,000 videos from billions in 10ms, then a heavier ranking model with hundreds of features selects the final 20 videos for the homepage
- **E-commerce (Amazon, eBay)** - the search query 'red Nike running shoes' passes through BM25 (keyword filtering), semantic search (finds 'red Nike athletic footwear' even without the word 'running'), and personalized ranking (accounts for the user's purchase history and budget)
Предварительные знания
PageRank, BM25, and learning to rank
Search ranking has several intertwined roots. On the statistical side, Stephen Robertson and Steve Walker formalized BM25 around 1994, a probabilistic ranking function that scored documents by term frequency and inverse document frequency, and it remains a strong baseline in search engines today. On the link-analysis side, in 1998 Sergey Brin and Larry Page, then graduate students at Stanford, published PageRank, which ranked web pages by treating links as votes and weighting each vote by the importance of the linking page. PageRank became the foundation of Google. The third root is machine learning. In 2005 Chris Burges and colleagues at Microsoft Research introduced RankNet, a neural network trained on pairs of documents to predict which should rank higher, launching the modern learning to rank field. RankNet led to LambdaRank and LambdaMART, which optimized ranking metrics directly and powered commercial search engines for years.
Learning to Rank
The search ranking task: given a query and a set of documents, sort the documents by **relevance**. This is not a plain classification (relevant/not relevant) but **ordering** - document A must appear above document B. Learning to Rank (LTR) is a family of ML approaches that learn to rank from labeled data. Labels are typically assigned by expert human annotators on a scale: 0 (not relevant), 1 (partially relevant), 2 (relevant), 3 (perfect answer).
**NDCG (Normalized Discounted Cumulative Gain)** is the primary ranking metric. It accounts for both relevance and position: a document with score 3 at position 1 is more valuable than at position 10. Discounting is logarithmic: gain / log2(position + 1). NDCG is normalized from 0 to 1, where 1 is perfect ranking. **MAP (Mean Average Precision)** is an alternative metric for binary relevance (relevant/not), averaging precision at each position where a relevant document appears.
**LambdaMART - the industry standard:** LambdaMART = Lambda (gradients) + MART (Multiple Additive Regression Trees, i.e. gradient boosted trees). Idea: instead of directly optimizing NDCG (which is non-differentiable), LambdaMART computes **lambda-gradients** - an approximation of the NDCG gradient through pairwise comparisons. Each pair of documents (A, B) receives a gradient proportional to the change in NDCG if they were swapped. Implementations: - **XGBoost** with objective='rank:ndcg' or 'rank:pairwise' - **LightGBM** with objective='lambdarank' - **CatBoost** with loss_function='YetiRank' LambdaMART won virtually every ranking competition before 2020 and is still used in production at Google, Bing, and Yahoo.
**Features for LTR** fall into three groups: **query-dependent** (query length, query type), **document-dependent** (PageRank, document length, freshness, number of inbound links), and **query-document** (BM25 score, number of term matches, TF-IDF similarity). In practice, major search engines use 500 to 1000+ features per (query, document) pair. LambdaMART on gradient boosted trees handles this many heterogeneous features extremely well.
Why does the pairwise approach (LambdaMART) typically outperform pointwise for ranking tasks?
BM25: Classic Information Retrieval
**BM25 (Best Matching 25)** - an algorithm from 1994 that remains the foundation of information retrieval in 2024. Elasticsearch, Solr, Lucene - all use BM25 by default. The idea: document relevance to a query is determined by **how often the query terms appear in the document** (term frequency), weighted by **how rare the term is across all documents** (inverse document frequency). The word 'quantum' in a physics document is more valuable than the word 'is', because 'quantum' appears rarely.
**TF saturation** is the main difference between BM25 and plain TF-IDF. In TF-IDF, if the word 'Python' appears 100 times, the document receives 100 times more points than one with a single mention. But 100 mentions don't mean 100x more relevance - it could just be SEO spam. BM25 limits the impact of repetitions: beyond a certain threshold, additional occurrences barely add to the score. The parameter k1 controls the saturation point.
**Parameters k1 and b - when to change them:** **k1 (default 1.5):** - k1 = 0: TF is completely ignored, only the presence of the word matters - k1 = 1.2: fast saturation (good for short queries) - k1 = 2.0: slow saturation (good for long documents) **b (default 0.75):** - b = 0: document length is ignored (all documents are "equal") - b = 0.75: standard normalization (works in most cases) - b = 1.0: full normalization (penalizes long documents more strongly) **When to change:** - Web search: k1=1.2, b=0.75 (standard) - Academic papers (long, uniform): k1=1.5, b=0.3 - Tweets and headlines (short): k1=1.5, b=0.0 In practice, default parameters work well in 90% of cases.
**Why BM25 is still relevant?** First, speed: BM25 with an inverted index processes millions of documents in milliseconds. Second, interpretability: you know exactly why a document received a high score. Third, reliability: BM25 requires no training (just two parameters) and works out of the box in any language. Neural ranking models outperform BM25 in quality but are 100-1000x slower. That is why modern search systems use BM25 as the **first stage** (retrieval) and neural networks for **re-ranking** the top-K results.
What happens if you set the BM25 parameter b = 0?
Neural Ranking: BERT and Semantic Search
BM25 searches by exact word matches: the query 'how to treat a headache' will not find a document that says 'migraine therapy methods'. **Neural ranking** solves this by comparing the **meaning** of the query and document through neural network embeddings. The basic approach is a **cross-encoder**: take BERT, feed the pair [CLS] query [SEP] document [SEP] as input, and train an output layer to predict relevance. The cross-encoder sees all interactions between query and document words, giving the best quality.
The problem with cross-encoders: for each query, BERT must run through ALL candidates. With 1 million documents that's 1 million BERT forward passes - hours per query. The **bi-encoder** solves this: query and document are encoded **independently** into fixed-size vectors. Documents are encoded once during indexing. At search time only the query is encoded, and the nearest documents are found via similarity (dot product or cosine). But bi-encoders have lower quality: they don't see interactions between query and document words.
**ColBERT - the best of both worlds (late interaction):** ColBERT (Contextualized Late Interaction over BERT) is a compromise between cross-encoder and bi-encoder. Idea: each word in the query and document is encoded into a separate vector (not one vector per text). Relevance = the sum of maximum similarities between each query word and any document word: score = SUM_i MAX_j similarity(q_i, d_j) - Documents are encoded offline (like bi-encoder) - But word-level interaction is preserved (like cross-encoder) - Quality close to cross-encoder, speed close to bi-encoder - Downside: storage is heavier (one vector per word)
**Modern search system pipeline**: BM25 retrieval (top-1000) -> bi-encoder re-ranking (top-100) -> cross-encoder re-ranking (top-10) -> business logic (top-K for user). Each stage is more precise but slower. BM25 processes millions of documents in milliseconds; a cross-encoder takes ~10ms per (query, document) pair. This cascaded design combines BM25's speed with the quality of neural networks.
Why is a cross-encoder not used for searching across the entire document collection in production?
Two-Tower Architecture
The **two-tower** architecture is the industry standard for retrieval at scale. The idea: two separate neural networks ("towers") encode the query and document into vectors of the same dimensionality. Similarity is measured via dot product or cosine similarity. The key advantage: documents are encoded **once during indexing**, and at search time only the query needs to be encoded, with the nearest vectors found via similarity. This is the foundation of search at Google, YouTube, Facebook, Amazon, and virtually any large search system.
The critical component of a two-tower system is **Approximate Nearest Neighbors (ANN)**. Exact nearest-vector search among 1 billion items is O(N) - too slow. ANN indices find *approximately* the nearest neighbors in O(log N), trading a little accuracy (recall ~95-99%) for speed. The main libraries: **FAISS** (Facebook), **ScaNN** (Google), **Annoy** (Spotify), **HNSW** (hnswlib). FAISS can search among one billion vectors in 1-10 milliseconds on a single server.
**Hard negatives - the secret to quality two-tower training:** Training on random negative examples is insufficient: the model quickly learns to distinguish "cake recipe" from "quantum physics", but struggles to distinguish "chocolate cake recipe" from "vanilla cake recipe". **Hard negatives** are documents that resemble relevant ones but are not the correct answer: 1. **BM25 negatives:** top BM25 results not marked as relevant 2. **In-batch negatives:** other documents in the current batch (cheap) 3. **Self-mining:** documents the current model erroneously ranks highly Iterative process: train model v1 -> mine hard negatives with v1 -> train v2 on hard negatives -> repeat 2-3 times.
**Online serving** of a two-tower model in production: the document tower runs offline for all documents (billions), and the vectors are indexed in FAISS or ScaNN. The query tower runs online - at each request it encodes the query into a vector in ~1ms on GPU. ANN search finds the top-K nearest vectors in ~5ms. Total: from query to top-K results from one billion documents in ~10ms. These top-K are then re-ranked by a cross-encoder for the final order. This is the standard search architecture at Google, YouTube, Pinterest, and Airbnb.
BM25 is obsolete since neural ranking appeared and is no longer needed
BM25 is used as the first-stage retrieval step in the vast majority of search systems, and neural rankers only operate on the top-K results from BM25
Neural ranking models (cross-encoder, bi-encoder) produce better quality but are 100-1000x slower than BM25. Running BERT through one billion documents at each query is impossible with a 200ms latency requirement. So BM25 with an inverted index selects top-1000 candidates in milliseconds, and neural networks re-rank only those candidates. Google, Bing, Elasticsearch - all use BM25 as the first stage. Hybrid search (BM25 + neural) consistently outperforms either approach on its own.
Why do the query tower and document tower in a two-tower architecture have separate weights rather than sharing one model?
Key Takeaways
- **Learning to Rank:** LambdaMART on gradient boosted trees is the ranking standard, optimizing NDCG through pairwise document comparisons using hundreds of query-document features
- **BM25:** a 1994 algorithm that remains the backbone of search thanks to speed (millions of documents in milliseconds), TF saturation (spam protection), and document length normalization
- **Neural Ranking:** cross-encoder BERT achieves the best quality by giving full attention to the (query, document) pair; bi-encoder trades quality for speed; ColBERT is a compromise via late interaction
- **Two-Tower + ANN:** query tower and document tower encode texts into vectors, and FAISS/ScaNN search among billions in 10ms - this is exactly how the search that handles 8.5 billion queries per day in 0.2 seconds is built
Related Topics
Search & Ranking combines classic information retrieval, NLP, and scalable systems, bridging foundational algorithms with production ML:
- Recommendation Systems — The two-tower architecture originated in recommendations: the same idea of encoding user and item into vectors, the same ANN search for retrieval, the same cascaded ranking. The difference: in search the primary signal is text matching; in recommendations it is user behavior
- BERT and GPT — Cross-encoder and bi-encoder for ranking are built on BERT: the cross-encoder uses full attention between query and document; the bi-encoder uses the CLS token of each text independently. Understanding the transformer architecture is critical for neural ranking
- Gradient Boosting (XGBoost) — LambdaMART is gradient boosted trees with lambda-gradients for NDCG optimization. All XGBoost principles (regularization, missing value handling, split finding) apply directly to ranking models
- Word Embeddings — Bi-encoder produces sentence embeddings - an extension of the word2vec idea to sentences and documents. The same operations (cosine similarity, nearest neighbors) are applied to higher-level text representations
Вопросы для размышления
- BM25 was created in 1994 and neural ranking emerged in 2018-2019. Why is BM25 still used as the first stage in virtually every search system? Under what conditions could you do without BM25 entirely?
- Cross-encoder gives the best quality, but bi-encoder/two-tower gives the best speed. ColBERT is a compromise. How would you choose an architecture for a system with 10 million documents and a latency requirement of < 100ms?
- Hard negatives are critical for training a two-tower model. Why does training only on random negatives lead to poor quality, and how does iterative self-mining of hard negatives solve this problem?
Связанные уроки
- ml-51-recommendation-systems — Ranking and recsys share retrieval logic
- ml-37-bert-gpt — Semantic ranking uses transformer encoders
- ml-22-gradient-boosting — Learning-to-rank uses gradient boosting
- ml-35-word-embeddings — Query-document matching uses embeddings
- sd-21-search-systems — Same retrieval-then-rank system pattern
- stat-09-regression