Recommender Systems

Candidate Generation

Цели урока

Understand how HNSW works and its trade-offs
Design a two-stage retrieval pipeline with multiple candidate sources
Apply a Bloom Filter for efficient deduplication of seen items
Compare LSH and HNSW on key characteristics

In 2016 YouTube published its paper on the recommendation system. The core engineering problem: from 800 million videos, return a personalized list of 10 within 200 ms. The solution - a two-stage architecture with ANN Search for retrieval - became the industry standard. Netflix, Spotify, and TikTok all use the same pattern.

**YouTube** - two-stage with matrix factorization for retrieval, DNN for ranking
**Spotify** - HNSW for ANN search across 80M tracks, Bloom Filter for viewed exclusion
**Pinterest** - LSH for pin recommendations against a continuously updated catalog

Предварительные знания

Embeddings and vector similarity
Matrix factorization basics
Big-O reasoning about latency and throughput

Two-Tower Retrieval and the YouTube Deep Learning Era

In 2016 Paul Covington, Jay Adams, and Emre Sargin of YouTube published "Deep Neural Networks for YouTube Recommendations", which split the problem into two stages: a deep candidate-generation network that narrows millions of videos down to hundreds, then a separate ranking network. The candidate model framed retrieval as extreme multiclass classification and trained it with sampled softmax, approximating a softmax over the whole catalog from a small set of negatives. At serving time the learned user vector queries a precomputed index of video vectors through approximate nearest neighbor search. This two-tower, two-stage design became the default retrieval pattern across the industry, because exact scoring of an entire catalog per request is impossible at web scale.

ANN Search: HNSW and ScaNN

Spotify's catalog holds 80 million tracks, each represented as a 128-dimensional vector. Finding the 1,000 nearest tracks to a user profile via brute-force means 80 million dot products per request. At 10,000 requests per second that is computationally infeasible. **ANN (Approximate Nearest Neighbor) Search** is a family of algorithms that return sufficiently close vectors in milliseconds, trading a small fraction of recall for orders-of-magnitude speedup.

**HNSW (Hierarchical Navigable Small World)** builds a multi-layer graph: upper layers are sparse "highways" for fast traversal; lower layers are dense neighbor maps. Search starts at the top and descends, approaching the target at each level. **ScaNN (Scalable Nearest Neighbors)** from Google uses product quantization and asymmetric distance scoring to achieve 10-20x speedup via SIMD CPU instructions.

HNSW stores the entire graph in RAM - 80M vectors at dim=128 requires ~40 GB. In production, the index is sharded by `userId % N`, with each shard on a separate host. Milvus and Weaviate implement this out of the box.

What is the main trade-off of ANN Search compared to exact (brute-force) search?

Two-Stage Architecture

A sophisticated ranking model (200+ features, 50 ms inference) cannot run against 80 million tracks. But ANN alone lacks contextual signals. The industry solution is a **two-stage architecture**: the first stage retrieves hundreds of candidates quickly; the second stage ranks only those candidates precisely.

Why does a two-stage recommender use multiple candidate sources (ANN + item2item + trending) instead of one?

Bloom Filter for deduplication

A user has already watched 50,000 videos. Every recommendation request must exclude them from candidates. Keeping a Set of 50,000 IDs in memory per request is expensive, and checking each candidate against Redis is too slow. A **Bloom Filter** is a probabilistic data structure: it uses ~100x less memory and answers "possibly seen" or "definitely not seen" in O(1).

A Bloom Filter never produces false negatives: if it says "not seen", that item is guaranteed to be new. False positives ("possibly seen") are acceptable: occasionally a new item is suppressed. For recommendations, that is preferable to repeatedly showing already-watched content.

A Bloom Filter reports that a user "possibly saw" video X. This means:

Locality-Sensitive Hashing

HNSW stores its entire graph in RAM - hundreds of gigabytes for a billion vectors. **Locality-Sensitive Hashing (LSH)** offers an alternative: instead of a graph, it uses hash tables where similar vectors land in the same bucket. The method is based on random projections: if two vectors are close in the original space, their projections onto a random hyperplane have the same sign with high probability.

LSH places similar vectors into the same bucket via random projections. What property makes this work?