Recommender Systems
RecSys at the interview (FAANG)
2020. TikTok overtakes Instagram on time-spent per user in two years: 95 min/day vs 53 min/day. The main secret is not prettier videos but a more accurate RecSys: the For You feed runs multi-objective ranking with carefully tuned engagement, satisfaction, and retention weights. ByteDance publishes the Monolith paper in 2022 - the architecture of their feature store. It becomes a reference for RecSys engineers. At Meta/ByteDance/Pinterest interviews they now probe exactly that depth: not just 'two-tower' but 'why did watch_time displace CTR in 2016 and what class of problems did it solve'. A staff ML engineer in this area gets USD 500K+ total compensation - and what they pay for is precisely the ability to see this production depth.
- **Meta News Feed**: 1B+ users, two-tower retrieval (FAISS) + DLRM ranking over thousands of features, multi-objective with weights for clicks/comments/likes/shares
- **TikTok For You**: ByteDance Monolith feature store, real-time user behaviour signals, per-user model fine-tuning
- **Spotify Discover Weekly**: collaborative filtering + audio embeddings + multi-objective (relevance + novelty + diversity)
- **Pinterest Visual Search**: CLIP-style multimodal embeddings, hybrid retrieval, billions of pins indexed
Предварительные знания
- Two-stage retrieval: candidate generation + ranking
- Netflix / YouTube / TikTok architectures
- RecSys metrics: CTR, watch-time, NDCG, retention
Two-Stage Retrieval as the Canonical ML System Design Answer
The FAANG RecSys interview discipline formed around one answer template: two-stage retrieval, candidate generation first, then ranking. The template took hold after Google's "Deep Neural Networks for YouTube Recommendations" (Covington et al. 2016) demonstrated the decomposition at industrial scale. Since then a candidate in an ML system design loop is expected to open with the retrieval and ranking split, discuss metric choice (CTR vs watch-time vs retention), and state the tradeoffs honestly: latency budget, cold start, feedback loop. The interviewer scores not knowledge of a specific model but the ability to structure the problem and defend each decision with a business signal.
Design Feed: Instagram, TikTok, Twitter
One of the most common questions at Meta/ByteDance/X: *"Design the Instagram/TikTok/Twitter feed."* This is ML system design, and the approach is the same as Grokking, only with RecSys specifics. **Step 1: clarify scope**. How many users (1B like Instagram or 100M like Pinterest)? What content type (video, photo, text)? What target metric - engagement (time spent), revenue (clicks), satisfaction (long-term retention)? Without that, the answer is pure handwaving. **Step 2: latency budget**. The feed loads in 200-500 ms on mobile - that leaves 100-300 ms for all-server-side processing (network + render eats the rest). That immediately dictates the architecture: you **cannot** score 10M items in real time, you need a two-stage pipeline: candidate generation (1K-10K items) + ranking (top 100).
The canonical 'two-tower' answer for a feed (used by Meta, YouTube, Pinterest). **Candidate generation**: a few thousand candidates from a 10M item pool via ANN (Annoy, FAISS, ScaNN), selected by dot product user_embedding @ item_embedding. The bottleneck is **400 us per query** for a FAISS HNSW index with 10M items at 256-dim embeddings. **Ranking**: 1000 candidates through a heavy model (DNN, gradient boosting over thousands of features: user history, item content, contextual signals). Ranking latency is 50-150 ms. The sum gives the latency budget. Same logic as in search engines (Google, Bing) - they are also two-stage: BM25/recall stage + neural reranker. At a Meta interview interviewers love it when the candidate proposes two-tower before being prompted.
Staff-interview trap: *"How do you avoid filter bubbles and cold start at the same time?"* This is the **explore-exploit** problem, and a good answer mentions several layers. **Randomisation layer**: inject 5-10% random/diverse candidates into the feed to collect feedback on new items. **Contextual bandit**: Thompson sampling to choose between 'safe' (what the user will surely like) and 'exploratory' (what might expand interests) - used by Spotify Discover Weekly. **Diversity penalty**: in re-ranking, penalise items that are too similar (via DPP - Determinantal Point Process). Cold start is handled separately: for new users - demographic fallbacks; for new items - a content-based boost in the first 48 hours, then transition to the collaborative signal. The same tension in training ML models: explore (new architectures) vs exploit (optimise the known) - a universal pattern.
At a Meta interview: *"A user has 10M potential items in their feed, latency budget 200 ms. How do you structure the pipeline?"*
Design Search: Spotify, Amazon, Pinterest
The question *"Design Amazon/Spotify/Pinterest search"* is a typical staff round at those companies. The main difference from a feed: search is **driven by intent** (user explicitly issued a query), feed is **driven by interest** (passive content delivery). That changes the architecture. **Stage 1 - retrieval** must take query relevance into account: traditional BM25 (text match) + dense retrieval (query embedding close to item embedding) - hybrid score. **Stage 2 - ranking** additionally takes personalization into account: the same item gets different scores for two users due to different user history. That is hybrid search, and its architecture grows more complex: the retrieval pipeline now needs a **lexical index** (Elasticsearch, Lucene) plus a **vector index** (FAISS) with fusion - usually through Reciprocal Rank Fusion.
A subtle point - **query understanding**. Amazon interviewers often catch candidates with the simplified answer 'just embed the query through BERT'. The right answer is multi-layered. **Layer 1**: query normalization - lowercase, spell correction ("adidash shoes" → "adidas shoes"), synonyms ("sneakers" ↔ "shoes"). **Layer 2**: intent classification - 'navigational' (user knows what they want: 'macbook pro 16'), 'informational' ('how to cook salmon'), 'transactional' ('buy iphone'); different intents require different ranking strategies. **Layer 3**: query rewriting - LLM-based expansion, used by Pinterest and Spotify in particular. **Layer 4**: personalized query: the same 'shoes' from a runner and from a fashion enthusiast are different queries. Knowing this hierarchy separates an ML engineer from a research scientist - the latter can do one thing, the former knows all the layers.
A favourite Pinterest trap: *"What if the query is an image, not text?"* Here the architecture moves to **multimodal retrieval**. CLIP (OpenAI) and its variants give a unified embedding space for image+text - you can index images via the CLIP image encoder, search both through the CLIP text encoder and through the CLIP image encoder (for visual similarity). Pinterest's stack is exactly this: 'Visual Search' - upload an image, find similar ones. The old-school alternative: extract features through ResNet, index in FAISS, train a classifier for tags. CLIP wins because zero-shot generalisation works on out-of-distribution images. Knowing this transition between approaches is a strong signal at a staff interview.
What is the principal architectural difference between search and feed (beyond the obvious 'there is a query')?
Metrics: what to measure and how to defend the choice
The most undervalued question at a staff interview: *"Which metric do you optimise in this design?"* Candidates often answer 'CTR' and fall into the trap - the interviewer wants a deep answer. CTR can be pushed up by sensational content (clickbait), which kills long-term retention. So modern RecSys works with **multi-objective optimization**: a weighted sum of multiple targets - p(click), p(watch_time>30s), p(satisfaction_rating), p(return_next_day). Weights are tuned for the business objective - and those weights become an ML/business interview question. **Canonical example**: YouTube in 2012 optimised CTR, got a clickbait explosion; in 2016 switched to watch_time, got longer videos but reduced satisfaction; in 2019 introduced satisfaction surveys as the ground-truth target. That metric evolution is the history of every YouTube algorithm change.
Distinguishing **online vs offline metrics**. **Offline** (on held-out data): precision@k, recall@k, NDCG, MAP, AUC - cheap for iteration but does not reflect behaviour change. **Online** (A/B tests): CTR, watch_time, retention, revenue per user - the final source of truth, but expensive (you need 2-4 weeks and millions of users in the experiment). The correlation between offline and online metrics is NOT always good - the classic Meta lesson from 2018. **Counterfactual evaluation** (Inverse Propensity Scoring, Off-Policy Evaluation) tries to estimate online via offline data but has limits. At staff/principal interviews they ask about your experience untangling offline/online divergence - it is a deep problem, and the answer shows engineering maturity.
At principal/distinguished interviews a common question is **proxy metrics**. Watch time as a proxy for satisfaction - okay? No: a long video might mean 'engaged' or 'bored-scrolling'. Advice: **triangulate** across multiple proxy metrics with different failure modes. The ML analogy: validation loss is a proxy for production quality. A candidate who can say *"our proxies may drift from the true objective; here is the monitoring and validation that catches drift"* is strong for a staff role. Same logic as Goodhart's Law: 'When a measure becomes a target, it ceases to be a good measure' - and RecSys engineers learned this through YouTube CTR maximisation in 2012.
A team wants to improve feed engagement. What is a mature interview answer about metric choice?
System design: production-grade and tradeoffs
At the principal level they ask about **production realities** that courses do not cover. **Online vs offline serving**. How are user embeddings updated? Daily batch job (Spark) - cheap but with 24-hour staleness; user behaviour from the past day is not reflected. Real-time updates - streaming through Kafka into a feature store (Redis or TikTok-style monolith feature service) - expensive but <1 sec latency. The middle ground - hybrid: stable features by batch, fast features (last_10_items_watched) by stream. **Item embeddings** are usually batch hourly because they are generative by definition (item content does not change fast). **Negative sampling** - often missed but critical: the model needs negatives, and uniform sampling gives poor quality (popular items dominate), in-batch sampling introduces bias. A candidate who proposes 'mixed negative sampling: 50% in-batch + 50% uniform' on their own is a clear production-experience signal.
**Model serving latency** - a frequent trap. Suppose your ranking DNN is 100M parameters, BERT-style. CPU inference - 200-500 ms per query (1000 candidates). That kills the latency budget. Three solutions. **Solution 1**: smaller model (distill via knowledge distillation into a 10x smaller model, quality loss <1% if done right). **Solution 2**: GPU serving (TensorRT, ONNX Runtime) - 10-20 ms per query, but USD 2-5 per million queries. **Solution 3**: caching: for top-1% queries (top_searches) precompute ranking results - latency win but bias toward head queries. In reality Meta and YouTube use all three at once: distilled models on CPU for the tail, GPU for high-revenue traffic, cache for the head. Knowing this traffic stratification is a strong signal.
The final maturity signal is **discussing failures** and **failure modes**. What happens if the FAISS index goes down? Fallback to a popularity-based feed (top items by recent engagement) - degraded UX but not an empty screen. What if the ranking model takes 5 seconds? Timeout with a 200 ms deadline, use candidate generation scores as a fallback. What about cold start (new user)? Demographic-based defaults plus diversity-heavy exploration. New item? Content-based boost for 48 hours, then transition to the collaborative signal. A candidate who PROACTIVELY discusses these scenarios without interviewer pushing is a strong staff/principal signal. Same intuition as SRE: 'design for failure' - a typical mantra of the Netflix Chaos Monkey culture.
RecSys system design is just applying a Grokking-style framework: requirements → API → high-level → data → scale
RecSys system design has specific axes the generic Grokking framework lacks. **Multi-objective optimization** (not one metric but a weighted sum). **Two-stage architecture** is mandatory (candidate gen + ranking). **Online vs offline serving** for different feature classes. **Negative sampling** strategies. **Cold start** for users and items as a first-class problem, not an edge case. **Explore-exploit** subtly drives retention. **Multimodal retrieval** (CLIP) for image search. At a Meta/Google/Pinterest interview these RecSys-specific topics are exactly what is tested - a generic system design framework is not enough.
RecSys is one of the most production-oriented ML areas, and the interview there evaluates not formula knowledge but production-problem experience. Generic system design will show a middle level; knowing RecSys-specifics (negative sampling, online learning loops, multi-objective, Goodhart's Law) shows a staff+ level. This differentiation is critical for compensation: a middle ML engineer at Meta makes USD 250K total; staff makes USD 500K+ - mostly because of the ability to see production problems that are not in textbooks.
At a staff interview you are asked about negative sampling for training. Which answer shows production experience?
Related topics
RecSys interviews intersect several disciplines:
- RecSys at scale — Netflix, YouTube, TikTok systems - canonical references for FAANG interviews
- Collaborative filtering basics — The base ML foundation - without it the design answers are unclear
- ML general interview — Same pattern: formula, tradeoff, production reality
- Bayesian methods — A/B test stats and CTR estimation built on Bayesian models
Key ideas
- **Design feed**: two-stage (candidate generation via ANN + ranking via DNN); latency budget defines the architecture; multi-objective with weights for CTR/watch_time/satisfaction
- **Design search**: hybrid retrieval (BM25 + dense), query understanding (normalise/intent/rewrite/personalize), multimodal (CLIP) for image search
- **Metrics**: multi-objective optimization, offline (NDCG, MAP) vs online (A/B); honest discussion of Goodhart's Law and the proxy-target gap
- **Production realities**: negative sampling (mixed: in-batch + hard + uniform), cold start (demographic + content-based), model serving (distill + GPU + cache), failure modes (FAISS down, ranking timeout)
Вопросы для размышления
- YouTube metric evolution 2012 → 2016 → 2019 (CTR → watch_time → satisfaction) - which class of mistakes repeats, and which tools could have warned earlier?
- ByteDance Monolith vs Meta DLRM - architecturally different approaches to a feature store. Which operational tradeoffs sit behind each, and on which teams does each fit?
- If a staff ML engineer earns USD 500K+ and a PhD student in RecSys research gets an $80K stipend, which specific production skills create this gap - and how do you build them outside FAANG?
Связанные уроки
- rec-13 — Netflix/YouTube/TikTok architectures are the base for most design questions
- rec-01 — CF/content-based basics form the foundation for the architecture
- ml-13-svm — ML interview pattern - same approach: formulas + tradeoffs + production realities
- ds-04-consistent-hashing — Sharding user embeddings is a direct application of distributed systems
- rt-13 — Realtime backend design - a parallel system-design framework
- prob-04-bayes — Metrics like CTR estimation and A/B tests on Bayesian models
- ml-01-intro