Recommender Systems

Recommendation Serving

Netflix generates recommendations for 230 million users. The full ML pipeline takes 50ms. Multiplied by 230M - physically impossible in real time. Serving architecture is the art of achieving personalization in 50ms while simultaneously having batch data, aging caches, and streaming events.

  • **Netflix:** Precomputes recommendations via batch jobs every few hours, stores them in Cassandra, updates event-driven on meaningful signals. Feature store holds 3000+ features. P99 latency of the home page: under 100ms.
  • **TikTok:** Kappa architecture on Flink processes 10+ million events per second. Recommendations update within seconds after each watch. Feature store on ByteKV (proprietary) serves millions of requests/sec.
  • **Uber:** Feature Store Michelangelo holds precomputed features for dozens of ML models (ETA, surge pricing, route recommendations). A single pipeline eliminates training-serving skew across the entire company's ML stack.

Предварительные знания

  • Two-stage retrieval and ranking
  • Embeddings and nearest-neighbor search
  • Latency, caching, and basic systems design
  • Candidate Generation
  • Online Learning and A/B Testing

FAISS, ScaNN, and Low-Latency Serving

Serving recommendations at scale means finding the nearest item vectors to a user vector among millions of candidates within a few milliseconds, which rules out brute-force comparison. In 2017 Jeff Johnson, Matthijs Douze, and Hervé Jégou at Facebook AI Research released FAISS, a library for billion-scale similarity search that pushed product quantization and inverted indexes onto the GPU and became the default tool for vector retrieval. In 2020 a Google team published ScaNN, which combines an anisotropic quantization loss tuned for inner-product search with fast SIMD scoring, beating earlier methods on standard benchmarks. These libraries are what let the candidate index built during training answer real-time queries under a tight latency budget, the gap between an offline model and a production recommender.

Latency budget: 100ms or the user is lost

Amazon calculated: every 100ms of additional latency equals a 1% revenue loss. Google: 500ms of delay equals 20% fewer search queries. For recommender systems latency is especially critical: a user opens Netflix's home page and waits. If recommendations haven't arrived within 100ms, the page shows static content or trending items. The entire complex ML stack (candidate generation, ranking, personalization) must fit within the latency budget. A typical budget: 50-100ms P99.

Latency budget decomposition for a Netflix-like system: candidate generation (ANN search in embedding space) - 10ms, feature lookup (feature store) - 5ms, ranking model (inference) - 20ms, business rules (filtering, deduplication) - 5ms, network latency - 10ms. Total: 50ms. At P99 each stage must fit its budget at the 99th percentile, not the average.

Why is P99 used for recommender system latency requirements rather than the average?

Caching: when personalization waits

Fully personalized real-time recommendations are expensive. For millions of users this is impossible without caching. The multi-layer cache strategy: L1 - precomputed recommendations (batch job every hour, stored in Redis), L2 - segment-based recommendations (users in a cluster receive the same list), L3 - trending (fallback on cache miss). Key insight: 80% of users are satisfied with 'sufficiently personalized' recommendations from a cache updated an hour ago.

Cache invalidation is classically hard. Invalidation triggers: user watched new content (taste signal), new content added to catalog (needs to appear in recommendations), model retrained (all caches are stale). Strategy: TTL-based invalidation with event-driven updates for high-value signals. A Kafka event 'user finished watching' triggers a consumer that invalidates that user's cache.

What is the primary goal of multi-layer caching in recommender systems?

Feature Store: bridge between ML and production

Feature Store solves one of the main problems of ML in production: training-serving skew. A model is trained on features computed offline. In production those same features must be retrieved in milliseconds. Without a feature store teams build two different pipelines: one for training (Spark, batch), another for serving (Python, real-time). They diverge: different null handling, different data types, different aggregations. A feature store centralizes computation and storage: one source of truth for features both at training and serving time.

Feature store architecture: offline store (Parquet/Delta Lake) for training data, online store (Redis/DynamoDB) for low-latency serving. A feature pipeline computes values and synchronizes both stores. Examples: Feast (open-source), Tecton, Hopsworks. The Netflix Feature Store holds over 3000 features for recommender systems. The Uber Michelangelo Feature Store serves millions of feature requests per second.

What is training-serving skew, and how does a Feature Store address it?

Real-time recommendations: event-driven updates

The best moment for a recommendation is immediately after a meaningful signal. A user just watched a science documentary - the perfect time to show similar content. An hour later the context is gone. Real-time recommender systems process events (watch, click, search) and update recommendations within seconds, not hours. Architecture: Kafka event -> Flink/Spark Streaming processing -> feature store update -> cache invalidation -> next request gets fresh recommendations.

Lambda architecture for recommendations: batch layer (model retraining once per day/week), speed layer (real-time contextual feature updates), serving layer (combines batch model with real-time features). Kappa architecture simplifies: one streaming pipeline that processes both historical and new events. TikTok moved to Kappa for recommendations: a single Flink pipeline processes 10+ million events per second.

Real-time recommendations require training the model in real time on every new event

Real-time recommendations typically mean real-time feature updates while using an offline-trained model. Full online model learning is the rare exception

Retraining a large model (embedding + ranking) takes hours. Real-time value is achieved through fast updates of contextual features (last watch, current search) while keeping a stable batch-trained model. Bandits (from rec-11) are the exception: they update in real time, but they are lightweight models, not deep learning.

What is the difference between Lambda and Kappa architectures for real-time recommendations?

Key ideas

  • **Latency budget** is distributed across pipeline stages (ANN + feature lookup + inference + rules). P99 latency is the working metric - the average hides the tail distribution where real users with poor experience reside.
  • **Multi-layer cache** (personalized / segment / trending) serves 80%+ of requests without ML inference. Event-driven invalidation provides freshness on meaningful signals.
  • **Feature Store** eliminates training-serving skew: one source of features for both training and serving. Lambda vs Kappa - trade-off between operational complexity and pipeline unification.

Related topics

Serving infrastructure ties together all layers of the recommender system:

  • Online Learning and A/B Testing — Feature store provides real-time features for bandit algorithms, and interleaving tests require serving infrastructure to mix candidates
  • Approximate Nearest Neighbors — ANN search is the most latency-sensitive stage of candidate generation. The algorithm choice (FAISS, HNSW, ScaNN) determines the first 10-20ms of the latency budget

Вопросы для размышления

  • Feature store synchronizes offline and online stores. What happens in a recommender system during a temporary online store outage (Redis outage), and how should graceful degradation be designed?
  • Kappa architecture handles historical retraining through Kafka topic replay with long retention. What problems arise when replaying 1 terabyte of historical events, and how are they solved?
  • Multi-layer cache serves segment recommendations on a personalized cache miss. How should the optimal segment size be determined: too small leads to poor cache hit ratio, too large to weak personalization?

Связанные уроки

  • rec-09 — Serving runs the two-stage retrieval pipeline
  • rec-11 — Online learning policies are deployed at serving
  • rec-13 — Serving patterns scale to industrial systems
  • aie-28-caching-optimization — Same caching tactics as LLM serving
  • ds-20-lru-cache — LRU caches hot recommendations near the user
  • dist-12-consistency
Recommendation Serving

0

1

Sign In