Information Retrieval

IR in the Interview Room: How FAANG Thinks About Search

A senior IR interview at Google does not ask 'how does BM25 work'. It asks: 'Your system improved NDCG by 2% but CTR dropped 3%. What do you do?' That question tests whether search is understood as an algorithm or as a product.

Google runs 10k+ A/B experiments per year on Search - each requires careful metric design to avoid false positives from position bias
Bing's quality team found NDCG improvements below 1% are often noise - interleaving experiments give 10x more statistical sensitivity
DoorDash rebuilt restaurant search ranking three times in two years as each optimized metric produced unintended side effects in production

Metrics: What NDCG Actually Measures

A Search Engineer interview at Google almost always opens the same way: 'How do you measure search quality?' The right answer does not start with the NDCG formula. It starts with a question: 'What counts as relevant for this product?' Different metrics optimize different properties.

Metric	What it measures	Best fit
Precision@k	Fraction of top-k that are relevant	Advertising - only top positions matter
Recall@k	Fraction of all relevant docs in top-k	Medical/legal - missing a result is dangerous
MRR	Position of the first relevant result	Navigational queries - user wants one exact page
MAP	Mean average precision across all recall levels	Full ranker evaluation over a test corpus
NDCG@k	Graded relevance with positional discount	When documents differ in degree of relevance

Interviewers always ask: 'Where do your relevance labels come from?' Options: human raters (expensive, slow, high quality), click-based labels (cheap, noisy, position-biased), interleaving experiments (A/B at SERP level - faster and more sensitive than standard A/B).

When is MRR preferable to NDCG?

System Design: Web Search from Scratch

'Design a web search system for a 10-billion-page corpus' - this is the canonical system design prompt at Google, Meta, and Microsoft. The key is not to produce a monolithic diagram, but to ask clarifying questions and build the system iteratively.

The Right Answer Structure

**Clarify requirements first:** Web or enterprise search? Target latency? QPS? Personalization needed? Semantic or keyword-only?
**Estimate scale:** 10B docs × 10 KB average = 100 TB raw. Inverted index is 20-30% of that. How many index servers needed at 10TB/machine?
**Sketch components:** crawler -> doc store -> indexer -> inverted index -> query processor -> ranker -> SERP
**Drill into bottlenecks:** index sharding strategy (document-partitioned vs term-partitioned), multi-stage ranking cascade, hot query caching

The interviewer will push on trade-offs: 'Why not run BERT directly on all 10B documents?' The answer: BERT cross-encoder takes ~100ms per (query, document) pair. At 10B documents that is physically impossible. Hence the cascade - cheap first stage narrows candidates, expensive second stage precisely ranks the small set.

Why can't a BERT cross-encoder be applied directly to 10 billion documents?

ML Design in Search: What Interviewers Test

Senior Search ML Engineer roles at FAANG blend IR and applied ML. The interview loop typically has: coding (standard LeetCode), ML design, system design, behavioral. The ML design section in search has specific patterns.

**Query understanding:** intent classification (navigational/informational/transactional), named entity recognition in the query, spell correction at scale.
**Candidate generation:** from 10B documents to 1000 candidates - BM25 vs ANN vs hybrid (RRF). Trade-off: recall vs latency vs infrastructure cost.
**Learning to Rank:** given (query, document, relevance_label) triples, build a ranker. Which features? Pointwise vs pairwise vs listwise loss?
**Evaluation design:** how to A/B test a new ranker without contaminating future training data? Interleaving, holdout sets, backtest with IPS.

On ML design questions in search: always structure the answer as offline evaluation first (NDCG on labeled data), then online evaluation (A/B or interleaving), then guardrail metrics (p99 latency, index freshness). Interviewers expect exactly this structure.

What is Reciprocal Rank Fusion and why is it useful?

Anti-Patterns: What Goes Wrong in Production IR

Search interviews at top companies test pattern recognition about failure modes, not just algorithm knowledge. Naming real anti-patterns signals production experience - the kind that comes from watching a 'good' model quietly make users unhappy for months.

**Metric gaming:** optimize CTR aggressively and users start clicking clickbait. They don't return. Fix: guardrail metrics (session abandonment rate, long-click rate) that cannot regress.
**Training-serving skew:** model trains on offline features, inference sees slightly different distributions (different tokenizer version, different preprocessing). Symptom: offline NDCG 0.85, online performs 2% worse than baseline.
**Index drift:** documents update, index does not. Catastrophic for news search. Fix: tiered index - fresh tier for new/updated documents, main tier for the stable corpus.
**Cold start:** new documents have no clicks. Content-based signals (domain authority, anchor text, PageRank) must carry the ranking until behavioral data accumulates.
**Vocabulary mismatch:** user writes 'heart attack', document says 'myocardial infarction'. BM25 scores zero. Hybrid BM25 + semantic search is mandatory for medical and legal corpora.

The best ranking model produces the best search experience. Improve NDCG offline and the system improves.

Search quality is a system property, not a model property. Index freshness, query understanding, latency, result diversity, entity resolution - any component can become the bottleneck. Offline NDCG weakly correlates with online satisfaction for many query types.

Offline evaluation uses historical clicks which encode all the biases of the old system. If the old system ranked medical queries poorly, there are few 'correct' clicks on relevant medical documents in training data. A new model trained on these biased labels inherits the same blind spots.

A well-built IR system only needs to be designed once and then maintained without rethinking its architecture.

Any IR system degrades over time: data distributions shift, user behavior evolves, and anti-patterns accumulate in legacy code without regular audits.

Systems appear stable at first glance - metrics don't drop sharply, and slow degradation is invisible without A/B testing and systematic monitoring.

What is training-serving skew and how do you diagnose it?

IR in Industry Context

Information retrieval is one of the few domains where algorithms, ML, system design, and product thinking intersect on every interview question.

System Design — Related topic
ML Engineering — Related topic
Product Thinking — Related topic

Итоги

NDCG is the standard metric with graded relevance; MRR for navigational, MAP for full corpus evaluation - metric choice depends on the product use case
Two-stage ranking (BM25/ANN candidates -> BERT re-ranker) is the production standard; applying BERT to a full corpus is computationally infeasible
RRF (Reciprocal Rank Fusion) is a parameter-free ensemble for combining lexical and dense retrieval - no training required
Anti-patterns to name: metric gaming, training-serving skew, index drift, cold start, vocabulary mismatch - production experience shows in the ability to name failure modes

Вопросы для размышления

If forced to pick one single metric to monitor a production search system - one that absolutely cannot regress under any change - which would it be and what is the reasoning?

Связанные уроки

ir-19-personalization — Personalization is a mandatory topic for senior IR roles at FAANG
ir-11 — Learning to Rank is among the first ML design questions for search engineer positions
ir-10 — ANN vector search is a standard system design block in semantic search interviews
ds-04-consistent-hashing — Sharding the inverted index via consistent hashing is a classic system design question