Information Retrieval

Search Quality Evaluation

A search team deploys a new ranking algorithm. Offline evaluation: NDCG up 3%. A/B test a week later: users search more often, click less. What went wrong? Without understanding evaluation metrics, this question has no answer.

**Google:** assessors rate relevance on a 0-4 scale; NDCG is the primary offline metric
**Spotify:** MRR for playlist search (one specific playlist is the target)
**Amazon:** MAP for product search (users browse multiple relevant items)
**Bing:** CTR and dwell time in real-time for production quality monitoring

NDCG: Normalized Discounted Cumulative Gain

A search returns 10 results. How good is that list? Expert assessors can rate each document on a scale from 0 to 3 (0 = irrelevant, 3 = perfectly answers the query). **NDCG** measures how close the actual ranking is to the ideal one, while giving more weight to documents at higher positions.

Two steps: First, **DCG** sums relevance scores with a logarithmic position discount - rank 1 has weight 1/log2(2)=1.0, rank 2 has weight 1/log2(3)≈0.63, rank 5 has weight 1/log2(6)≈0.39. Second, normalize by **IDCG** - the DCG of the ideal ranking - to get a score in [0, 1].

**NDCG vs Precision@k.** Precision@k treats relevance as binary. NDCG supports **graded relevance** - grades 0, 1, 2, 3. A document rated 3 should not be treated the same as one rated 1. NDCG correlates better with measured user satisfaction than Precision@k.

NDCG@5 = 0.95 means:

MAP: Mean Average Precision

NDCG requires multi-grade relevance labels. MAP works with binary relevance: a document is either relevant (1) or not (0). MAP captures recall well: finding one relevant document is not enough - **all** relevant documents must be found.

**Average Precision (AP)** for a single query: compute precision at every rank where a relevant document appears, then average those values. **MAP** is the mean AP across all queries in the test set.

**MAP vs NDCG in practice.** MAP suits recall-oriented systems where all relevant documents must be found (legal search, medical literature). NDCG suits precision-oriented systems where one good answer is enough (web search, Q&A). Most production systems compute both.

AP divides by the total number of relevant documents, not just the number found. Why?

MRR: Mean Reciprocal Rank

Sometimes only one thing matters: the position of the first correct answer. FAQ search, direct answers, voice assistants - in these contexts a single accurate result at the top is the goal. **MRR** (Mean Reciprocal Rank) measures exactly that: at what rank does the first relevant document appear?

**RR** (Reciprocal Rank) for one query = 1 / rank_of_first_relevant. First position: RR = 1.0. Third position: RR = 0.333. Not found in top-K: RR = 0. **MRR** is the mean RR across all queries.

Metric	What it measures	When to use
NDCG	Full ranking quality with position weighting	Web search, recommendations
MAP	Precision + recall across the full list	Legal search, recall-critical tasks
MRR	Position of the first correct answer	FAQ, voice assistants, Q&A

**MRR limitations.** MRR considers only the first relevant document and ignores all subsequent ones. When a user needs several different answers, MAP or NDCG is more informative than MRR.

System A has MRR = 0.9, System B has MRR = 0.5. What does this mean in practice?

Online Metrics: CTR, Dwell Time, Abandonment

NDCG, MAP, and MRR are offline metrics: they require expert labels and do not show what real users actually do. **Online metrics** are collected automatically from user behavior in production and provide direct signal about system quality.

**CTR (Click-Through Rate)** is the fraction of queries that result in at least one click. High CTR means the results look attractive. But CTR is easy to game with clickbait titles - they increase CTR while harming the user experience.

**Dwell time** is the time spent on a page after a click. Short dwell (<30s) signals pogo-sticking: the user returned immediately, the document did not help. Long dwell (>2 min) signals the document was useful. **Abandonment rate** is the fraction of sessions with no click and no reformulation - the user gave up.

**A/B testing vs offline metrics.** A new ranking algorithm may improve NDCG in offline evaluation but reduce CTR in an A/B test. Possible reasons: assessors and users disagree, test samples differ, offline labels are outdated. The final deployment decision is always made based on online metrics.

Pogo-sticking in the context of search means: