Information Retrieval
Search Quality Evaluation
A search team deploys a new ranking algorithm. Offline evaluation: NDCG up 3%. A/B test a week later: users search more often, click less. What went wrong? Without understanding evaluation metrics, this question has no answer.
- **Google:** assessors rate relevance on a 0-4 scale; NDCG is the primary offline metric
- **Spotify:** MRR for playlist search (one specific playlist is the target)
- **Amazon:** MAP for product search (users browse multiple relevant items)
- **Bing:** CTR and dwell time in real-time for production quality monitoring
NDCG: Normalized Discounted Cumulative Gain
A search returns 10 results. How good is that list? Expert assessors can rate each document on a scale from 0 to 3 (0 = irrelevant, 3 = perfectly answers the query). **NDCG** measures how close the actual ranking is to the ideal one, while giving more weight to documents at higher positions.
Two steps: First, **DCG** sums relevance scores with a logarithmic position discount - rank 1 has weight 1/log2(2)=1.0, rank 2 has weight 1/log2(3)≈0.63, rank 5 has weight 1/log2(6)≈0.39. Second, normalize by **IDCG** - the DCG of the ideal ranking - to get a score in [0, 1].
**NDCG vs Precision@k.** Precision@k treats relevance as binary. NDCG supports **graded relevance** - grades 0, 1, 2, 3. A document rated 3 should not be treated the same as one rated 1. NDCG correlates better with measured user satisfaction than Precision@k.
NDCG@5 = 0.95 means:
MAP: Mean Average Precision
NDCG requires multi-grade relevance labels. MAP works with binary relevance: a document is either relevant (1) or not (0). MAP captures recall well: finding one relevant document is not enough - **all** relevant documents must be found.
**Average Precision (AP)** for a single query: compute precision at every rank where a relevant document appears, then average those values. **MAP** is the mean AP across all queries in the test set.
**MAP vs NDCG in practice.** MAP suits recall-oriented systems where all relevant documents must be found (legal search, medical literature). NDCG suits precision-oriented systems where one good answer is enough (web search, Q&A). Most production systems compute both.
AP divides by the total number of relevant documents, not just the number found. Why?
MRR: Mean Reciprocal Rank
Sometimes only one thing matters: the position of the first correct answer. FAQ search, direct answers, voice assistants - in these contexts a single accurate result at the top is the goal. **MRR** (Mean Reciprocal Rank) measures exactly that: at what rank does the first relevant document appear?
**RR** (Reciprocal Rank) for one query = 1 / rank_of_first_relevant. First position: RR = 1.0. Third position: RR = 0.333. Not found in top-K: RR = 0. **MRR** is the mean RR across all queries.
| Metric | What it measures | When to use |
|---|---|---|
| NDCG | Full ranking quality with position weighting | Web search, recommendations |
| MAP | Precision + recall across the full list | Legal search, recall-critical tasks |
| MRR | Position of the first correct answer | FAQ, voice assistants, Q&A |
**MRR limitations.** MRR considers only the first relevant document and ignores all subsequent ones. When a user needs several different answers, MAP or NDCG is more informative than MRR.
System A has MRR = 0.9, System B has MRR = 0.5. What does this mean in practice?
Online Metrics: CTR, Dwell Time, Abandonment
NDCG, MAP, and MRR are offline metrics: they require expert labels and do not show what real users actually do. **Online metrics** are collected automatically from user behavior in production and provide direct signal about system quality.
**CTR (Click-Through Rate)** is the fraction of queries that result in at least one click. High CTR means the results look attractive. But CTR is easy to game with clickbait titles - they increase CTR while harming the user experience.
**Dwell time** is the time spent on a page after a click. Short dwell (<30s) signals pogo-sticking: the user returned immediately, the document did not help. Long dwell (>2 min) signals the document was useful. **Abandonment rate** is the fraction of sessions with no click and no reformulation - the user gave up.
**A/B testing vs offline metrics.** A new ranking algorithm may improve NDCG in offline evaluation but reduce CTR in an A/B test. Possible reasons: assessors and users disagree, test samples differ, offline labels are outdated. The final deployment decision is always made based on online metrics.
Pogo-sticking in the context of search means:
Search Quality Evaluation
- NDCG: position-weighted gain with graded relevance, normalized against the ideal ranking
- MAP: mean Average Precision - captures both position quality and recall completeness
- MRR: reciprocal of the first relevant document's rank - for Q&A and FAQ tasks
- CTR: fraction of sessions with at least one click; high CTR does not guarantee quality
- Dwell time: time on page after click; <30s = pogo-sticking = document did not help
- Abandonment rate: no click and no reformulation = user gave up; <10% is healthy
Related Topics
Evaluation metrics are needed to measure the full IR pipeline - from indexing through ranking.
- Query Autocomplete and Suggest — Suggest quality is also measured with MRR and CTR
- Ranking and BM25 — Ranking algorithms whose quality NDCG and MAP measure
- Learning to Rank — ML approach to ranking that directly optimizes NDCG
Вопросы для размышления
- Why does NDCG improvement in offline evaluation sometimes fail to predict CTR improvement in an A/B test?
- How does position bias distort CTR interpretation, and how is it corrected?
- In which products is a high abandonment rate acceptable or even expected?