Recommender Systems

Deep Learning Recommendations

YouTube serves recommendations to 2 billion users from a catalog of billions of videos. The retrieval step (Two-Tower) narrows that to a few hundred candidates in milliseconds; the ranker (DeepFM/AutoInt) picks the final list. NCF, DeepFM, AutoInt, and Two-Tower form the standard layered architecture behind the feeds at YouTube, TikTok, Netflix, and Spotify.

**YouTube** uses a Two-Tower retrieval model to pull ~hundreds of candidates from a billion-video corpus, then ranks them with a deep network combining user history, freshness, and content features
**TikTok** runs a multi-stage funnel: Two-Tower retrieval, then DeepFM-style ranking with hundreds of dense and sparse features, optimizing watch time and engagement signals
**Pinterest** trained the original NCF benchmark and now uses graph-augmented embeddings on top of the Two-Tower pattern for related-pin retrieval

Предварительные знания

Matrix factorization and the dot product between user and item embeddings
Multilayer perceptrons and backpropagation
Embeddings as dense representations of categorical features

Matrix Factorization

From GroupLens to Neural Collaborative Filtering

In 1994, Paul Resnick and colleagues at MIT published GroupLens, the first collaborative filtering system, applied to Usenet news. Users rated articles; the system predicted ratings by finding users with similar taste. The Netflix Prize (2006-2009) accelerated the field by 10 years: a $1M prize for improving Netflix's matrix factorization baseline by 10% attracted over 51,000 contestants and produced Ensemble MF, SVD++, and Restricted Boltzmann Machines. The winning BellKor's Pragmatic Chaos solution combined over 100 models. Xiangnan He's Neural Collaborative Filtering (2017) demonstrated that neural networks outperform MF by modeling non-linear interactions, opening the door to DeepFM, AutoInt, and the Two-Tower architecture now powering YouTube and TikTok.

Neural Collaborative Filtering

**NCF (Neural Collaborative Filtering)** replaces the dot product of matrix factorization with a neural network that can model non-linear user-item interactions. The flagship architecture, **NeuMF**, runs two parallel paths: a Generalized Matrix Factorization (GMF) path that preserves the linear signal, and an MLP path that captures complex patterns. Their outputs are concatenated and fed to a final prediction layer.

**Negative sampling is critical for implicit feedback**: most recommendation datasets have only positive signals (clicks, purchases). For every positive interaction, NCF samples K random unobserved items as negatives during training. The ratio K=4 (4 negatives per positive) is the standard from the original paper. Without negative sampling, the model collapses - predicting positive for everything.

Why does NCF outperform classic matrix factorization for recommendation tasks?

DeepFM: Factorization Machines + Deep Network

**DeepFM** combines Factorization Machines (for second-order feature interactions) with a Deep Neural Network (for higher-order interactions), sharing embeddings between both components. The key insight: FM explicitly models pairwise interactions between all feature fields in O(kn) time, while the DNN learns arbitrary higher-order combinations. Shared embeddings means features are learned jointly for both objectives.

**FM second-order term in O(kn) time**: the naive computation of all pairwise interactions is O(n^2). FM uses the identity: sum_i sum_{j>i} (v_i . v_j) x_i x_j = 0.5 * (||sum_i v_i x_i||^2 - sum_i ||v_i||^2 x_i^2), reducing to O(kn). This is why FM scales to millions of features - each pairwise interaction is computed implicitly through embedding dot products.

What architectural decision distinguishes DeepFM from Wide&Deep (Google, 2016)?

AutoInt: Attention for feature interactions

**AutoInt** applies multi-head self-attention to feature embeddings, allowing each feature field to attend to all other fields with learned, adaptive weights. Unlike FM (which weights all pairwise interactions equally) or DNN (which treats all features as a flat vector), AutoInt learns which feature combinations are important for each specific instance. This provides interpretable interaction weights.

**AutoInt interpretability**: the attention weights in each head reveal which feature pairs the model considers important for a prediction. For click-through rate prediction, heads may specialize - one head might learn user-age x item-category interactions, another might focus on time-of-day x device-type. This interpretability is a practical advantage over black-box MLPs.

What does AutoInt model through self-attention that FM and simple MLP cannot?

Two-Tower: retrieval architecture

The **Two-Tower** model encodes users and items into a shared embedding space using separate neural networks (towers). Relevance is the dot product (or cosine similarity) of the two embeddings. The critical property: item embeddings can be **precomputed offline** and indexed with an Approximate Nearest Neighbor (ANN) search structure. At serving time, only the user tower runs - enabling retrieval from billions of items in milliseconds.

**Two-Tower pipeline in production**: (1) Offline: train Two-Tower, export item embeddings, index with FAISS/ScaNN/Milvus. (2) Online: compute user embedding in real time (user tower inference ~1ms), query ANN index for top-K items (~5ms), pass candidates to a ranking model (DeepFM or AutoInt). The retrieval step reduces billions of items to hundreds; the ranker applies expensive models to just those hundreds.

Why is Two-Tower suitable for retrieval but not for ranking in recommendation systems?

Key Takeaways

**NCF** replaces the dot-product of matrix factorization with a neural net (NeuMF combines a GMF path with an MLP path), capturing non-linear user-item interactions
**DeepFM** shares embeddings between a Factorization Machine (second-order interactions in O(kn)) and a DNN (higher-order interactions), avoiding the manual cross-feature engineering of Wide&Deep
**AutoInt** applies multi-head self-attention to feature fields, learning per-instance interaction weights that are interpretable through attention scores
**Two-Tower** separates user and item encoders so item embeddings can be precomputed and served via ANN (FAISS, ScaNN, Milvus). Retrieval narrows billions of items to hundreds, then a ranker scores those candidates

Вопросы для размышления

How does Neural Collaborative Filtering apply to real production systems?
What are the tradeoffs when choosing between the approaches covered in this lesson?
How would Two-Tower: retrieval architecture change system design decisions?

Связанные уроки

rec-03 — Matrix factorization and collaborative filtering are the baselines that deep learning models extend
cv-04 — Both recommendation and vision models follow the same evolution: simple linear models to deep architectures with attention
ds-04-consistent-hashing — Two-Tower embeddings are served via ANN (Approximate Nearest Neighbor) search across distributed shards using consistent hashing
cloud-04 — Two-Tower item embeddings are precomputed and served from GPU instances; EC2 spot instances are used for batch embedding generation
rec-05 — Multi-task learning and real-time feature serving are the next step after mastering these architectures
ml-51-recommendation-systems
dl-03
aie-09-embeddings
ml-01