Recommender Systems
Collaborative Filtering
Netflix Prize 2006: one million dollars for a 10% improvement in recommendation quality. BellKor Pragmatic Chaos won. Netflix never deployed the algorithm - too expensive to integrate. Today: 80% of Netflix viewing comes from recommendations. Amazon generates 35% of revenue through the phrase 'Customers who bought this also bought...'. Behind that phrase is collaborative filtering - an algorithm that has no idea what a 'book' or a 'film' actually is. It knows only one thing: who rated what.
- **Amazon** (item-based CF, Linden et al. 2003) processes billions of purchases to suggest similar items - the precomputed item-item matrix is refreshed daily, generating 35% of revenue
- **Netflix** uses both user-based and item-based CF as components of a hybrid system combined with deep learning - 80% of viewing comes from recommendations
- **Spotify** Discover Weekly - user-based CF on implicit feedback (listening time), 30 tracks per week, MAU >400 million
- **Last.fm** built a scrobble-based CF: analyzes listening history of millions of users to generate music recommendations without any explicit ratings
- **YouTube** uses item-based CF as one component of its hybrid ranking system - over 500 hours of video uploaded every minute
Предварительные знания
Amazon and the Birth of Industrial Collaborative Filtering
In 2003, Greg Linden, Brent Smith, and Jeremy York of Amazon published "Amazon.com Recommendations: Item-to-Item Collaborative Filtering". They showed that a precomputed item-item similarity table enables O(1) recommendations - just a table lookup. This solved the scale problem: Amazon was serving millions of users, and user-based CF was too slow. The approach became the industry standard for a decade. US Patent 6266649B1 - 'item-to-item collaborative filtering' - still belongs to Amazon.
User-Based CF: Finding Similar Users
Netflix Prize 2006: one million dollars for a 10% improvement in recommendation quality. BellKor Pragmatic Chaos won. Netflix never used the algorithm - too expensive to deploy. Today: 80% of Netflix viewing comes from recommendations. The price of that 80% is not complex mathematics. It is one question: **who watched the same thing?**
The algorithm works in three steps: 1. **Compute similarity** between the target user and all others 2. **Select k nearest neighbors** (users with the highest similarity) 3. **Predict the rating** as a weighted average of the neighbors' ratings Prediction formula for user **u** on item **i**:
Why use **(r_vi - r_v_mean)** rather than simply **r_vi**? Because people use rating scales differently. A generous user gives 4-5 stars to everything. A strict one gives 2-3. If Bob rates a film 4 (with a personal mean of 4.5), that is **below his norm** - he did not particularly like it. Subtracting the mean **normalizes** the ratings: +0.5 means "better than usual", -1.0 means "worse than usual" for that user.
**Complexity of user-based CF:** each prediction requires computing similarity against **every** user - O(n) per prediction, where n is the number of users. For Netflix at 500M users this is prohibitive. In practice: (1) precompute the similarity matrix offline, (2) use approximate nearest neighbors (ANN), or (3) switch to item-based CF.
**The minimum co-rating threshold** is a frequent beginner mistake. If two users have rated only 1 common film with the same score, their Pearson correlation is 1.0 (perfect). But this is **statistically meaningless**. Always enforce a minimum number of co-rated items (typically 5-20) before trusting the similarity value.
User A rates all films 4-5 stars (mean 4.5). User B rates them 1-3 (mean 2.0). Both gave film X a rating of 5. What does normalization reveal?
Item-Based CF: Finding Similar Items
In 2003, Amazon engineers published a paper that changed the industry: **"Item-to-Item Collaborative Filtering"** (Linden et al.). Rather than searching for similar users, they proposed searching for **similar items**. If people who bought book A also bought book B, those books are similar. The famous phrase "Customers who bought this also bought..." is item-based CF. That algorithm added 35% to Amazon's annual revenue.
The key advantage is **stability**. User tastes drift - one may watch action films today and documentaries next year. But the relationship between items is stable: The Lord of the Rings and The Hobbit will always be similar. The item-item similarity matrix is computed once and updated infrequently. There are also typically **far fewer items than users**, so the matrix is more compact.
**Adjusted cosine vs plain cosine for item-based CF.** Plain cosine compares matrix columns directly. The problem: a user who gives everything 5 stars inflates similarity for every item pair. **Adjusted cosine** subtracts each user's mean before computing - this is the item-based analog of Pearson correlation. Research consistently shows adjusted cosine to be significantly more accurate.
**Precomputation is the key to scaling.** In production, the item-item similarity matrix is computed offline (a batch job running daily). At request time the system looks up: which items has this user rated -> for each, find k similar items -> rank. The online portion is O(m*k), where m is the number of user ratings and k is the neighborhood size. With 100 ratings and k=20, that is 2,000 lookups - milliseconds.
Why did Amazon choose item-based CF over user-based CF in 2003?
Similarity Metrics: Cosine, Pearson, Jaccard
The whole mechanism of collaborative filtering rests on one question: **how do we measure similarity?** Two users are two rating vectors. How "close" are they? The answer depends on the chosen metric, and this choice **critically affects** recommendation quality. Three core metrics - **cosine similarity**, **Pearson correlation**, **Jaccard index** - each occupy a distinct niche.
| Metric | Range | Accounts for scale | Best application |
|---|---|---|---|
| Cosine similarity | [-1, 1] | No - compares vector direction | Explicit ratings (1-5 stars), TF-IDF text vectors |
| Pearson correlation | [-1, 1] | Yes - subtracts the mean | Explicit ratings with heterogeneous user rating scales |
| Jaccard index | [0, 1] | Not applicable | Binary data: bought/not bought, clicked/not clicked, liked/disliked |
| Adjusted cosine | [-1, 1] | Yes - subtracts each user's mean | Item-based CF (the standard for item similarity) |
**Which metric to use when?** A practical rule: - **Explicit ratings** (1-5 stars): **Pearson** for user-based, **adjusted cosine** for item-based. Normalization removes the effect of generous and strict raters. - **Binary data** (bought / not bought): **Jaccard**. There is no "scale" - only sets. - **Implicit feedback** (watch time, clicks): **Cosine** on normalized data. Pearson performs poorly on non-negative data.
**Pearson is undefined when variance is zero.** If a user gave every film the same rating (e.g., all 5s), the mean-centered vector is [0, 0, 0, ...]. Pearson correlation produces division by zero. Handle this edge case explicitly - return 0 (no information about preferences).
**Production advice:** do not commit to one metric - A/B test all of them. Research shows that the "best" metric depends on the domain and data distribution. For MovieLens, Pearson typically wins; for Amazon (implicit), cosine usually does better. The only reliable method is experimentation on the actual production data.
An online store knows only whether a purchase was made (bought / not bought), without any ratings. Which similarity metric is most appropriate?
Choosing the Neighborhood Size: the k Parameter
Similarity to every user has been computed. Now the question is: **how many neighbors (k) to use for prediction?** This is a hyperparameter, and choosing it correctly is a balance between noise and signal. Too few neighbors make predictions unstable. Too many pull in distant users with irrelevant tastes.
**Weighted average** is the key mechanism. Not all neighbors are equal: the nearest neighbor (similarity = 0.95) should have more influence than a distant one (similarity = 0.3). Ratings are therefore weighted by similarity. This automatically reduces the impact of distant neighbors - even when k is large.
| Parameter | Small k (1-5) | Medium k (10-30) | Large k (50+) |
|---|---|---|---|
| Bias | Low | Medium | High |
| Variance | High | Medium | Low |
| Speed | Fast | Moderate | Slow |
| Sparse data | May find no neighbors | Balanced | Includes noisy neighbors |
| Dense data | Unstable | Optimal | Excessive |
**Threshold-based neighborhood** is an alternative to a fixed k. Instead of "take the k nearest", take "everyone with similarity > threshold" (e.g., > 0.5). This is adaptive: a popular user may have 100 neighbors, a niche user just 3. A combined approach: threshold > 0.3 AND k <= 50.
**A common mistake: evaluating k on the same data used for training.** This will underestimate error and inflate the apparent optimal k. Always use a held-out set or cross-validation. On MovieLens-1M, the optimal k is typically 20-50 for user-based and 10-30 for item-based CF.
The more neighbors (k), the better the prediction - more data means more information
There is an optimal k that depends on the data. A k that is too large adds noise from irrelevant 'neighbors', diluting personalization. The optimum is typically k=20-50 for user-based and k=10-30 for item-based CF.
This is a manifestation of the bias-variance tradeoff. With a small k - high variance (noisy predictions from individual neighbors). With a large k - high bias (predictions converge to the mean because distant 'neighbors' contribute irrelevant ratings). The optimal k minimizes total error and is found via cross-validation on the specific dataset.
With k=100 in user-based CF, predictions become less personalized and converge toward the global mean rating. Why?
Key Takeaways
- **User-based CF** finds similar users and borrows their ratings. Netflix Prize 2006 and 80% of Netflix viewing - all rooted in this single idea
- **Item-based CF** compares items by their rating patterns. More stable than user-based (items do not change their 'taste'), scales better, can be precomputed offline. Amazon Patent 2003 - Greg Linden et al.
- **Similarity metrics**: Pearson for explicit ratings (normalizes scale), Jaccard for binary data (bought/not bought), adjusted cosine for item-based CF. Metric selection is critical - validate with A/B test
- **The k parameter** - bias-variance tradeoff: small k = noise, large k = loss of personalization. Optimum via cross-validation, typically 10-50
- **Precomputing the item-item matrix** is the key to production scale: 2,000 lookups instead of O(n*m) in real-time
Related Topics
Collaborative filtering is the foundation on which advanced methods are built:
- Introduction to Recommender Systems — Overview of content-based, collaborative, and hybrid approaches
- Matrix Factorization — SVD and ALS solve the sparsity problem of CF by extracting latent factors from the rating matrix
Вопросы для размышления
- User-based CF and item-based CF produce different recommendations. In which scenarios would one approach be better? Consider the ratio of users to items on different platforms.
- Pearson correlation = 1.0 between two users who have co-rated only 2 films. Should this similarity be trusted? What minimum co-rating threshold would be appropriate?
- If designing collaborative filtering for a platform where 99% of the data is implicit (clicks, views) and only 1% is explicit (ratings), what strategy would work best?
Связанные уроки
- rec-01 — Overview of content-based and collaborative approaches before the deep dive
- rec-03 — Matrix Factorization solves CF sparsity through latent factors
- ml-14-knn — kNN - the same nearest-neighbor mechanics applied to classification
- ml-19-pca — PCA and SVD - the mathematical foundation for matrix factorization CF
- prob-04-bayes — Bayesian update as an alternative lens for rating prediction
- aie-09-embeddings — Neural CF uses embeddings instead of explicit similarity matrices
- la-15-svd