Data Science
Recommender Systems
YouTube generates 70% of all watch time through recommendations. Netflix drives 80% of viewing. Spotify's Discover Weekly accounts for 30% of streams. Recommender systems are not a feature - they are the business model. And behind each of those numbers sits a specific algorithm, a specific architecture, and years of A/B tests.
- Spotify Discover Weekly generates a personalized 30-song playlist every Monday via hybrid CF + content features. Launched in 2015, it reached 40M users in the first year
- Amazon Product Recommendations drive 35% of total revenue - 'Customers who bought X also bought Y' through item-based CF remains one of the most profitable algorithms in e-commerce history
- TikTok For You Page uses a Two-Tower model with reinforcement learning - optimizing not just relevance but watch time and probability of a like
Collaborative Filtering: wisdom of the crowd
Netflix Prize. 2006. One million dollars for a 10% improvement in RMSE. 2009: BellKor's Pragmatic Chaos - an ensemble of 107 algorithms - won. The core insight: there is no need to know anything about a movie. Knowing what users with similar ratings chose next is enough.
Collaborative Filtering has two paradigms. Memory-based: compute user or item similarity via cosine or Pearson correlation. Model-based: Matrix Factorization, SVD, ALS. User-based CF: 'users similar to the target rated X highly'. Item-based CF: 'users who liked A also rate B highly'.
Matrix Factorization (SVD, ALS) outperforms memory-based methods. The rating matrix R (users x items) = U x V^T, where U contains latent user factors and V latent item factors. Each user and item is represented as a k-dimensional vector. ALS (Alternating Least Squares) optimizes U and V alternately. Spark MLlib implements distributed ALS for millions of users.
Cold start is the main weakness of Collaborative Filtering. A new user has no interaction history - nothing to compute similarity from. Solutions: popularity-based fallback (recommend trending items), onboarding survey, or a hybrid with content-based. Spotify uses an onboarding step (choose 5 artists) for every new user.
What is the main problem Collaborative Filtering has with new users?
Content-Based Filtering: the DNA of a product
Content-Based Filtering recommends items similar in characteristics to items the user has already rated. There is no cold start problem for items: a new film has a genre, director, and cast. There is no dependency on other users. The downside: the system cannot venture outside known preferences - a filter bubble forms.
Content representation: text descriptions turn into TF-IDF or dense embeddings. Categorical features become one-hot encodings. Numerical features are normalized. A user profile is built as a weighted sum of the vectors of items the user interacted with. Recommendation = cosine similarity between the profile and candidate items.
Pandora Music Genome Project is the extreme content-based case: each song is described by 450 attributes (key, tempo, vocal style, instruments, structure). Human music analysts annotate manually. Annotation cost: millions of dollars. The result: recommendations without cold start, without popularity bias, always explainable.
Why does Content-Based Filtering create a filter bubble?
Hybrid: the best of both worlds
Netflix, Spotify, YouTube - all use hybrid approaches. CF works well for users with history but poorly for new ones. Content-based works well for new items and explainability but creates filter bubbles. Hybrid combines both.
Two-Tower Model is the modern standard at Airbnb, Pinterest, and YouTube. User encoder: user -> embedding. Item encoder: item -> embedding. Dot product = relevance score. Both towers are trained jointly. Item embeddings are precomputed, so inference is an ANN search completed in milliseconds across millions of items.
Why does the Two-Tower model separate user and item encoders?
Metrics: precision@K, NDCG, diversity
RMSE - the classic Netflix Prize metric - turned out to be a poor proxy for real engagement. Low RMSE can coexist with poor recommendations. Ranking metrics are better: what matters is not rating prediction accuracy but the quality of ranking in the top-K.
A/B testing is the final arbiter. Offline metrics (NDCG) often fail to correlate with online metrics (CTR, retention). YouTube found that a model with better NDCG produced worse watch time. The reason: filter bubble - users saw predictable content and disengaged faster.
Serendipity vs accuracy trade-off: an accurate system recommends only what the user will almost certainly rate highly. Boring. A serendipitous system injects unexpected high-quality recommendations - long-term engagement is higher. Netflix deliberately includes long-tail recommendations, accepting a short-term NDCG drop for better long-term retention.
A better offline NDCG guarantees better online CTR and retention
Offline metrics are necessary but not sufficient. Only an A/B test with real users reveals true quality
Offline evaluation is built on historical interactions - a biased sample. Distribution shift between training and validation is common. YouTube and Netflix regularly find that the 'best' offline model performs worse in production because of filter bubble effects, overfitting to historical bias, or ignoring long-term engagement
Why does NDCG outperform Precision@K for ranking evaluation?
Related Topics
Recommender systems combine NLP, ML, and search:
- NLP for Data Science — Text embeddings for content-based filtering
- RAG — Similar retrieval and ranking problem
Key Ideas
- Collaborative Filtering: wisdom of the crowd via matrix factorization. Cold start is the main weakness
- Content-Based: item DNA, explainable, no item cold start, but creates filter bubbles
- Hybrid: Two-Tower = user embedding + item embedding -> ANN search in production
- Metrics: NDCG beats Precision@K; offline vs online gap is real; A/B test is the final arbiter
Вопросы для размышления
- How do you balance accuracy and diversity in recommendations without hurting engagement?
- When is item-based CF preferable to user-based, and why did Amazon choose item-based?
- How do you solve the cold start problem for a new user without an onboarding survey?
Связанные уроки
- ds-15 — Text embeddings from NLP are the core of content-based recommendations
- dl-08 — Neural collaborative filtering through deep learning
- ds-07 — Matrix factorization relies on linear algebra
- gai-16 — RAG is retrieval plus ranking - the same problem as recommender systems
- ml-51-recommendation-systems