Recommender Systems

Collaborative Filtering

Netflix Prize 2006: one million dollars for a 10% improvement in recommendation quality. BellKor Pragmatic Chaos won. Netflix never deployed the algorithm - too expensive to integrate. Today: 80% of Netflix viewing comes from recommendations. Amazon generates 35% of revenue through the phrase 'Customers who bought this also bought...'. Behind that phrase is collaborative filtering - an algorithm that has no idea what a 'book' or a 'film' actually is. It knows only one thing: who rated what.

  • **Amazon** (item-based CF, Linden et al. 2003) processes billions of purchases to suggest similar items - the precomputed item-item matrix is refreshed daily, generating 35% of revenue
  • **Netflix** uses both user-based and item-based CF as components of a hybrid system combined with deep learning - 80% of viewing comes from recommendations
  • **Spotify** Discover Weekly - user-based CF on implicit feedback (listening time), 30 tracks per week, MAU >400 million
  • **Last.fm** built a scrobble-based CF: analyzes listening history of millions of users to generate music recommendations without any explicit ratings
  • **YouTube** uses item-based CF as one component of its hybrid ranking system - over 500 hours of video uploaded every minute

Предварительные знания

  • Introduction to Recommender Systems

Amazon and the Birth of Industrial Collaborative Filtering

In 2003, Greg Linden, Brent Smith, and Jeremy York of Amazon published "Amazon.com Recommendations: Item-to-Item Collaborative Filtering". They showed that a precomputed item-item similarity table enables O(1) recommendations - just a table lookup. This solved the scale problem: Amazon was serving millions of users, and user-based CF was too slow. The approach became the industry standard for a decade. US Patent 6266649B1 - 'item-to-item collaborative filtering' - still belongs to Amazon.

User-Based CF: Finding Similar Users

Netflix Prize 2006: one million dollars for a 10% improvement in recommendation quality. BellKor Pragmatic Chaos won. Netflix never used the algorithm - too expensive to deploy. Today: 80% of Netflix viewing comes from recommendations. The price of that 80% is not complex mathematics. It is one question: **who watched the same thing?**

The algorithm works in three steps: 1. **Compute similarity** between the target user and all others 2. **Select k nearest neighbors** (users with the highest similarity) 3. **Predict the rating** as a weighted average of the neighbors' ratings Prediction formula for user **u** on item **i**:

Why use **(r_vi - r_v_mean)** rather than simply **r_vi**? Because people use rating scales differently. A generous user gives 4-5 stars to everything. A strict one gives 2-3. If Bob rates a film 4 (with a personal mean of 4.5), that is **below his norm** - he did not particularly like it. Subtracting the mean **normalizes** the ratings: +0.5 means "better than usual", -1.0 means "worse than usual" for that user.

**Complexity of user-based CF:** each prediction requires computing similarity against **every** user - O(n) per prediction, where n is the number of users. For Netflix at 500M users this is prohibitive. In practice: (1) precompute the similarity matrix offline, (2) use approximate nearest neighbors (ANN), or (3) switch to item-based CF.

**The minimum co-rating threshold** is a frequent beginner mistake. If two users have rated only 1 common film with the same score, their Pearson correlation is 1.0 (perfect). But this is **statistically meaningless**. Always enforce a minimum number of co-rated items (typically 5-20) before trusting the similarity value.

User A rates all films 4-5 stars (mean 4.5). User B rates them 1-3 (mean 2.0). Both gave film X a rating of 5. What does normalization reveal?

Item-Based CF: Finding Similar Items

In 2003, Amazon engineers published a paper that changed the industry: **"Item-to-Item Collaborative Filtering"** (Linden et al.). Rather than searching for similar users, they proposed searching for **similar items**. If people who bought book A also bought book B, those books are similar. The famous phrase "Customers who bought this also bought..." is item-based CF. That algorithm added 35% to Amazon's annual revenue.

The key advantage is **stability**. User tastes drift - one may watch action films today and documentaries next year. But the relationship between items is stable: The Lord of the Rings and The Hobbit will always be similar. The item-item similarity matrix is computed once and updated infrequently. There are also typically **far fewer items than users**, so the matrix is more compact.

**Adjusted cosine vs plain cosine for item-based CF.** Plain cosine compares matrix columns directly. The problem: a user who gives everything 5 stars inflates similarity for every item pair. **Adjusted cosine** subtracts each user's mean before computing - this is the item-based analog of Pearson correlation. Research consistently shows adjusted cosine to be significantly more accurate.

**Precomputation is the key to scaling.** In production, the item-item similarity matrix is computed offline (a batch job running daily). At request time the system looks up: which items has this user rated -> for each, find k similar items -> rank. The online portion is O(m*k), where m is the number of user ratings and k is the neighborhood size. With 100 ratings and k=20, that is 2,000 lookups - milliseconds.

Why did Amazon choose item-based CF over user-based CF in 2003?

Similarity Metrics: Cosine, Pearson, Jaccard

The whole mechanism of collaborative filtering rests on one question: **how do we measure similarity?** Two users are two rating vectors. How "close" are they? The answer depends on the chosen metric, and this choice **critically affects** recommendation quality. Three core metrics - **cosine similarity**, **Pearson correlation**, **Jaccard index** - each occupy a distinct niche.

MetricRangeAccounts for scaleBest application
Cosine similarity[-1, 1]No - compares vector directionExplicit ratings (1-5 stars), TF-IDF text vectors
Pearson correlation[-1, 1]Yes - subtracts the meanExplicit ratings with heterogeneous user rating scales
Jaccard index[0, 1]Not applicableBinary data: bought/not bought, clicked/not clicked, liked/disliked
Adjusted cosine[-1, 1]Yes - subtracts each user's meanItem-based CF (the standard for item similarity)

**Which metric to use when?** A practical rule: - **Explicit ratings** (1-5 stars): **Pearson** for user-based, **adjusted cosine** for item-based. Normalization removes the effect of generous and strict raters. - **Binary data** (bought / not bought): **Jaccard**. There is no "scale" - only sets. - **Implicit feedback** (watch time, clicks): **Cosine** on normalized data. Pearson performs poorly on non-negative data.

**Pearson is undefined when variance is zero.** If a user gave every film the same rating (e.g., all 5s), the mean-centered vector is [0, 0, 0, ...]. Pearson correlation produces division by zero. Handle this edge case explicitly - return 0 (no information about preferences).

**Production advice:** do not commit to one metric - A/B test all of them. Research shows that the "best" metric depends on the domain and data distribution. For MovieLens, Pearson typically wins; for Amazon (implicit), cosine usually does better. The only reliable method is experimentation on the actual production data.

An online store knows only whether a purchase was made (bought / not bought), without any ratings. Which similarity metric is most appropriate?

Choosing the Neighborhood Size: the k Parameter

Similarity to every user has been computed. Now the question is: **how many neighbors (k) to use for prediction?** This is a hyperparameter, and choosing it correctly is a balance between noise and signal. Too few neighbors make predictions unstable. Too many pull in distant users with irrelevant tastes.

**Weighted average** is the key mechanism. Not all neighbors are equal: the nearest neighbor (similarity = 0.95) should have more influence than a distant one (similarity = 0.3). Ratings are therefore weighted by similarity. This automatically reduces the impact of distant neighbors - even when k is large.

ParameterSmall k (1-5)Medium k (10-30)Large k (50+)
BiasLowMediumHigh
VarianceHighMediumLow
SpeedFastModerateSlow
Sparse dataMay find no neighborsBalancedIncludes noisy neighbors
Dense dataUnstableOptimalExcessive

**Threshold-based neighborhood** is an alternative to a fixed k. Instead of "take the k nearest", take "everyone with similarity > threshold" (e.g., > 0.5). This is adaptive: a popular user may have 100 neighbors, a niche user just 3. A combined approach: threshold > 0.3 AND k <= 50.

**A common mistake: evaluating k on the same data used for training.** This will underestimate error and inflate the apparent optimal k. Always use a held-out set or cross-validation. On MovieLens-1M, the optimal k is typically 20-50 for user-based and 10-30 for item-based CF.

The more neighbors (k), the better the prediction - more data means more information

There is an optimal k that depends on the data. A k that is too large adds noise from irrelevant 'neighbors', diluting personalization. The optimum is typically k=20-50 for user-based and k=10-30 for item-based CF.

This is a manifestation of the bias-variance tradeoff. With a small k - high variance (noisy predictions from individual neighbors). With a large k - high bias (predictions converge to the mean because distant 'neighbors' contribute irrelevant ratings). The optimal k minimizes total error and is found via cross-validation on the specific dataset.

With k=100 in user-based CF, predictions become less personalized and converge toward the global mean rating. Why?

Key Takeaways

  • **User-based CF** finds similar users and borrows their ratings. Netflix Prize 2006 and 80% of Netflix viewing - all rooted in this single idea
  • **Item-based CF** compares items by their rating patterns. More stable than user-based (items do not change their 'taste'), scales better, can be precomputed offline. Amazon Patent 2003 - Greg Linden et al.
  • **Similarity metrics**: Pearson for explicit ratings (normalizes scale), Jaccard for binary data (bought/not bought), adjusted cosine for item-based CF. Metric selection is critical - validate with A/B test
  • **The k parameter** - bias-variance tradeoff: small k = noise, large k = loss of personalization. Optimum via cross-validation, typically 10-50
  • **Precomputing the item-item matrix** is the key to production scale: 2,000 lookups instead of O(n*m) in real-time

Related Topics

Collaborative filtering is the foundation on which advanced methods are built:

  • Introduction to Recommender Systems — Overview of content-based, collaborative, and hybrid approaches
  • Matrix Factorization — SVD and ALS solve the sparsity problem of CF by extracting latent factors from the rating matrix

Вопросы для размышления

  • User-based CF and item-based CF produce different recommendations. In which scenarios would one approach be better? Consider the ratio of users to items on different platforms.
  • Pearson correlation = 1.0 between two users who have co-rated only 2 films. Should this similarity be trusted? What minimum co-rating threshold would be appropriate?
  • If designing collaborative filtering for a platform where 99% of the data is implicit (clicks, views) and only 1% is explicit (ratings), what strategy would work best?

Связанные уроки

  • rec-01 — Overview of content-based and collaborative approaches before the deep dive
  • rec-03 — Matrix Factorization solves CF sparsity through latent factors
  • ml-14-knn — kNN - the same nearest-neighbor mechanics applied to classification
  • ml-19-pca — PCA and SVD - the mathematical foundation for matrix factorization CF
  • prob-04-bayes — Bayesian update as an alternative lens for rating prediction
  • aie-09-embeddings — Neural CF uses embeddings instead of explicit similarity matrices
  • la-15-svd
Collaborative Filtering

0

1

Sign In