Information Retrieval
IR in the Interview Room: How FAANG Thinks About Search
A senior IR interview at Google does not ask 'how does BM25 work'. It asks: 'Your system improved NDCG by 2% but CTR dropped 3%. What do you do?' That question tests whether search is understood as an algorithm or as a product.
- Google runs 10k+ A/B experiments per year on Search - each requires careful metric design to avoid false positives from position bias
- Bing's quality team found NDCG improvements below 1% are often noise - interleaving experiments give 10x more statistical sensitivity
- DoorDash rebuilt restaurant search ranking three times in two years as each optimized metric produced unintended side effects in production
Metrics: What NDCG Actually Measures
A Search Engineer interview at Google almost always opens the same way: 'How do you measure search quality?' The right answer does not start with the NDCG formula. It starts with a question: 'What counts as relevant for this product?' Different metrics optimize different properties.
| Metric | What it measures | Best fit |
|---|---|---|
| Precision@k | Fraction of top-k that are relevant | Advertising - only top positions matter |
| Recall@k | Fraction of all relevant docs in top-k | Medical/legal - missing a result is dangerous |
| MRR | Position of the first relevant result | Navigational queries - user wants one exact page |
| MAP | Mean average precision across all recall levels | Full ranker evaluation over a test corpus |
| NDCG@k | Graded relevance with positional discount | When documents differ in degree of relevance |
Interviewers always ask: 'Where do your relevance labels come from?' Options: human raters (expensive, slow, high quality), click-based labels (cheap, noisy, position-biased), interleaving experiments (A/B at SERP level - faster and more sensitive than standard A/B).
When is MRR preferable to NDCG?
System Design: Web Search from Scratch
'Design a web search system for a 10-billion-page corpus' - this is the canonical system design prompt at Google, Meta, and Microsoft. The key is not to produce a monolithic diagram, but to ask clarifying questions and build the system iteratively.
The Right Answer Structure
- **Clarify requirements first:** Web or enterprise search? Target latency? QPS? Personalization needed? Semantic or keyword-only?
- **Estimate scale:** 10B docs × 10 KB average = 100 TB raw. Inverted index is 20-30% of that. How many index servers needed at 10TB/machine?
- **Sketch components:** crawler -> doc store -> indexer -> inverted index -> query processor -> ranker -> SERP
- **Drill into bottlenecks:** index sharding strategy (document-partitioned vs term-partitioned), multi-stage ranking cascade, hot query caching
The interviewer will push on trade-offs: 'Why not run BERT directly on all 10B documents?' The answer: BERT cross-encoder takes ~100ms per (query, document) pair. At 10B documents that is physically impossible. Hence the cascade - cheap first stage narrows candidates, expensive second stage precisely ranks the small set.
Why can't a BERT cross-encoder be applied directly to 10 billion documents?
ML Design in Search: What Interviewers Test
Senior Search ML Engineer roles at FAANG blend IR and applied ML. The interview loop typically has: coding (standard LeetCode), ML design, system design, behavioral. The ML design section in search has specific patterns.
- **Query understanding:** intent classification (navigational/informational/transactional), named entity recognition in the query, spell correction at scale.
- **Candidate generation:** from 10B documents to 1000 candidates - BM25 vs ANN vs hybrid (RRF). Trade-off: recall vs latency vs infrastructure cost.
- **Learning to Rank:** given (query, document, relevance_label) triples, build a ranker. Which features? Pointwise vs pairwise vs listwise loss?
- **Evaluation design:** how to A/B test a new ranker without contaminating future training data? Interleaving, holdout sets, backtest with IPS.
On ML design questions in search: always structure the answer as offline evaluation first (NDCG on labeled data), then online evaluation (A/B or interleaving), then guardrail metrics (p99 latency, index freshness). Interviewers expect exactly this structure.
What is Reciprocal Rank Fusion and why is it useful?
Anti-Patterns: What Goes Wrong in Production IR
Search interviews at top companies test pattern recognition about failure modes, not just algorithm knowledge. Naming real anti-patterns signals production experience - the kind that comes from watching a 'good' model quietly make users unhappy for months.
- **Metric gaming:** optimize CTR aggressively and users start clicking clickbait. They don't return. Fix: guardrail metrics (session abandonment rate, long-click rate) that cannot regress.
- **Training-serving skew:** model trains on offline features, inference sees slightly different distributions (different tokenizer version, different preprocessing). Symptom: offline NDCG 0.85, online performs 2% worse than baseline.
- **Index drift:** documents update, index does not. Catastrophic for news search. Fix: tiered index - fresh tier for new/updated documents, main tier for the stable corpus.
- **Cold start:** new documents have no clicks. Content-based signals (domain authority, anchor text, PageRank) must carry the ranking until behavioral data accumulates.
- **Vocabulary mismatch:** user writes 'heart attack', document says 'myocardial infarction'. BM25 scores zero. Hybrid BM25 + semantic search is mandatory for medical and legal corpora.
The best ranking model produces the best search experience. Improve NDCG offline and the system improves.
Search quality is a system property, not a model property. Index freshness, query understanding, latency, result diversity, entity resolution - any component can become the bottleneck. Offline NDCG weakly correlates with online satisfaction for many query types.
Offline evaluation uses historical clicks which encode all the biases of the old system. If the old system ranked medical queries poorly, there are few 'correct' clicks on relevant medical documents in training data. A new model trained on these biased labels inherits the same blind spots.
A well-built IR system only needs to be designed once and then maintained without rethinking its architecture.
Any IR system degrades over time: data distributions shift, user behavior evolves, and anti-patterns accumulate in legacy code without regular audits.
Systems appear stable at first glance - metrics don't drop sharply, and slow degradation is invisible without A/B testing and systematic monitoring.
What is training-serving skew and how do you diagnose it?
IR in Industry Context
Information retrieval is one of the few domains where algorithms, ML, system design, and product thinking intersect on every interview question.
- System Design — Related topic
- ML Engineering — Related topic
- Product Thinking — Related topic
Итоги
- NDCG is the standard metric with graded relevance; MRR for navigational, MAP for full corpus evaluation - metric choice depends on the product use case
- Two-stage ranking (BM25/ANN candidates -> BERT re-ranker) is the production standard; applying BERT to a full corpus is computationally infeasible
- RRF (Reciprocal Rank Fusion) is a parameter-free ensemble for combining lexical and dense retrieval - no training required
- Anti-patterns to name: metric gaming, training-serving skew, index drift, cold start, vocabulary mismatch - production experience shows in the ability to name failure modes
Вопросы для размышления
- If forced to pick one single metric to monitor a production search system - one that absolutely cannot regress under any change - which would it be and what is the reasoning?
Связанные уроки
- ir-19-personalization — Personalization is a mandatory topic for senior IR roles at FAANG
- ir-11 — Learning to Rank is among the first ML design questions for search engineer positions
- ir-10 — ANN vector search is a standard system design block in semantic search interviews
- ds-04-consistent-hashing — Sharding the inverted index via consistent hashing is a classic system design question