Information Retrieval

Learning to Rank

2010. Yahoo! Learning to Rank Challenge: 473 teams competing for $30,000. Task: rank documents using 700 features. Winner: LambdaMART with NDCG@5 = 0.798. Second place: 0.003 behind. The Microsoft Research algorithm was immediately deployed to Bing and Yahoo! Search. This was the moment Machine Learning permanently displaced hand-tuned formulas from the core of production search.

**Yandex**: MatrixNet (LambdaMART variant) with thousands of features - the backbone of ranking since 2009
**LinkedIn**: Gradient Boosted Trees for job recommendation ranking - 30% CTR increase vs the previous model
**Airbnb**: LambdaMART for listing ranking - optimizing booking conversion rate

Historical context

In 2005 Chris Burges at Microsoft Research published 'Learning to Rank using Gradient Descent' - the paper introducing RankNet, the first successful neural approach to Learning to Rank. In 2006 Burges added the key LambdaRank idea: instead of the real gradient, use a 'lambda' - a scaled gradient where the scale equals |ΔNDCG|. Finally in 2010 LambdaMART combined lambda gradients with GBDT. The entire line of work went from academic curiosity to industry standard in 5 years.

Pointwise: ranking as regression

The pointwise approach treats ranking as regression or classification for each document independently. For a query q and document d, a scalar relevance score is predicted. Documents are sorted by descending predicted score. The problem: the model sees no other documents during training - there is no information about their relative order.

Ranking evaluation metrics: **NDCG@k** (Normalized Discounted Cumulative Gain) - weights relevance by position; **MAP** (Mean Average Precision) - for binary labels; **MRR** (Mean Reciprocal Rank) - position of the first relevant result. Pointwise optimizes MSE, not NDCG - hence the mismatch between the loss and the real metric.

Why does the pointwise approach to Learning to Rank produce worse NDCG results than pairwise/listwise?

Pairwise: ranking as pair comparison

The pairwise approach trains the model to predict the relative order of pairs of documents: P(doc_i > doc_j | query). For each query, pairs (doc_i, doc_j) are generated where i is more relevant. The classical algorithm is **RankNet** (Burges, 2005): a neural network with cross-entropy loss on pairs.

The number of pairs grows quadratically: for 100 documents in a list - up to 4,950 pairs. For production search engines with thousands of documents per query, this is unacceptable. Solutions: pair sampling or switching to listwise methods.

What is the main scaling problem with pairwise Learning to Rank on large document lists?

Listwise: direct NDCG optimization

Listwise approaches optimize the entire list at once. **ListNet** (Cao et al., 2007) minimizes the KL divergence between the predicted and ideal probability distributions of the top-1 document. **ListMLE** optimizes the log-likelihood of the correct ranking. Both methods work directly with list-level metrics, not pairs.

Listwise approaches are theoretically superior to pointwise and pairwise, but in practice the difference is small. LambdaMART is a hybrid: it uses pairwise lambda gradients to directly optimize NDCG, giving the best of both worlds.

What is the key advantage of listwise methods over pointwise and pairwise?

LambdaMART: GBDT for ranking

LambdaMART (Burges, 2010) is the most effective LtR algorithm for traditional feature-based systems. It combines MART (Multiple Additive Regression Trees, i.e., GBDT) with **lambda gradients**. The lambda gradient for a pair (i, j) accounts for the NDCG change from swapping documents: |ΔNDCG| * (pairwise RankNet gradient). The result: NDCG is optimized through an efficient GBDT framework.

LambdaMART won the Yahoo! Learning to Rank Challenge 2010. Used in Bing, Yahoo!, and Yandex as both baseline and production model. **lambdarank_truncation_level** is critical: considering only top-K positions when computing ΔNDCG speeds up training and focuses the model on relevant positions.

Neural LtR models always outperform LambdaMART in production search systems

LambdaMART is often competitive with or outperforms neural models with well-engineered features, especially with limited training data

Neural networks require large data volumes to generalize; LambdaMART is more efficient with tabular features and smaller datasets

How does LambdaMART combine pairwise and listwise approaches?

Key ideas

**Pointwise**: relevance score regression independently for each document - simple, but does not optimize order
**Pairwise** (RankNet): predicts order of document pairs - closer to the task, but O(n^2) pairs per query
**Listwise** (ListNet, ApproxNDCG): directly optimizes full-list metrics
**LambdaMART**: GBDT with lambda gradients = pairwise mechanism + listwise metric, industry standard

Вопросы для размышления

Why is the lambda gradient multiplied by |ΔNDCG| rather than using the plain RankNet gradient - what does this give the model?
How does position bias in click data affect LtR model quality, and how does IPS correction address it?
Under what conditions does neural LtR (DSSM, Neural Ranking) outperform LambdaMART in production search?

Связанные уроки

ir-03 — BM25 and classical IR metrics - the baseline that Learning to Rank improves on
ir-05 — Neural ranking builds on the LtR concepts established here
ml-22-gradient-boosting — LambdaMART extends standard GBDT with specialized lambda gradients for NDCG optimization
rec-04 — LtR in search is analogous to Two-Tower ranking in recommendations - both rank candidates by relevance
ml-52-search-ranking — LtR methods are the backbone of production search ranking in real search engines
ml-01