Recommender Systems

Matrix Factorization

September 2009. Netflix Prize. Team BellKor's Pragmatic Chaos wins 1 million dollars. Their solution: an ensemble of hundreds of models with SVD at the core. The Netflix rating matrix - 500 million users x 15,000 films - is less than 1% filled. 7.5 trillion empty cells. Matrix factorization fills them by "guessing" latent preferences that exist nowhere in the data. A few hundred numbers per user instead of terabytes of emptiness.

**Netflix Prize** (2009) - the winning ensemble used SVD and its variants as a core component, blending latent factor models with neighborhood-based approaches
**Spotify Discover Weekly** uses matrix factorization on a user x track matrix of billions of plays (implicit feedback) to generate personalized playlists
**Apache Spark MLlib** implements ALS as its primary production-ready recommendation algorithm, processing petabytes of data across distributed clusters

Предварительные знания

Collaborative Filtering

Netflix Prize and the Birth of Modern Recommender Systems

October 2006. Netflix launches a competition with a 1 million dollar prize: improve Cinematch (their recommendation algorithm) by 10% on RMSE. Three years of competition - over 2,000 teams from 186 countries. Winner BellKor's Pragmatic Chaos hit 10.06% improvement just hours before the final deadline. Their approach combined SVD++, temporal dynamics, and neighbourhood models. The irony: Netflix never deployed the winning algorithm in production - by then the business had shifted to streaming, where implicit feedback (watch time) outweighs explicit star ratings.

SVD: Hidden Factors of Preference

Why do certain people love Inception, Interstellar, and The Matrix? These films share a **latent factor** - call it "mind-bending sci-fi". The factor lives nowhere in the database, yet it leaks out through rating patterns. **Matrix Factorization** automatically extracts such hidden factors by decomposing the rating matrix into the product of two smaller matrices.

**SVD (Singular Value Decomposition)** decomposes the rating matrix **R** (m users x n items) into three matrices: **R ≈ U · Σ · V^T** - **U** (m x k) - user matrix: each row is a vector of a user's latent preferences - **Σ** (k x k) - diagonal matrix of the significance of each factor - **V^T** (k x n) - item matrix: each column is a vector of latent item characteristics **k** is the number of latent factors (typically 20-200). This is a "compressed" representation: instead of storing all ratings, we store k numbers per user and k per item.

**Latent factors have no labels.** SVD finds mathematical axes that maximize explained variance. The "sci-fi" or "romance" interpretation comes after the fact, by examining correlations. Sometimes a factor resists interpretation - it blends several aspects. Totally fine: the model does not need to "understand" factors to predict accurately.

**Classic SVD chokes on missing values.** `numpy.linalg.svd` demands a complete matrix - no zeros, no NaN. In a real rating matrix 97-99% of cells are empty. Filling missing entries with zeros distorts the decomposition (zero != "not rated"). Workarounds: 1. impute with user/item means 2. use ALS or SGD-based optimization that operates only on known entries.

**Choosing k (number of factors)** is a trade-off between accuracy and generalization. k=2 is too coarse; k=1000 invites overfitting. On MovieLens-10M the sweet spot is usually k=50-200. Practical recipe: plot the decay of singular values (a scree plot) and pick the "elbow" - the point after which values fall slowly.

SVD decomposes the rating matrix R into U·Σ·V^T. What do the rows of matrix U represent?

ALS: Iterative Optimization for Sparse Matrices

Classic SVD demands a complete matrix. But on Netflix 99% of cells are empty. So what now? **ALS (Alternating Least Squares)** nails the problem compactly: instead of decomposing a full matrix, it **optimizes** purely on **known ratings**. Like reassembling a jigsaw with only 1% of the pieces in hand - yet the algorithm reconstructs the full picture.

The ALS idea: find matrices **P** (users x k) and **Q** (items x k) such that **R ≈ P · Q^T**. Minimize error only on known ratings: **min Σ(u,i)∈known (r_ui - p_u · q_i^T)² + λ(||P||² + ||Q||²)** The term **λ(||P||² + ||Q||²)** is **regularization** - the brake on overfitting. Drop it and the model memorizes training ratings while predicting new ones poorly.

**ALS parallelizes naturally.** Key property: with Q fixed, the update for each p_u is **independent** of every other user. Step 1 is embarrassingly parallel across users; step 2 is embarrassingly parallel across items. Exactly why **Apache Spark MLlib** picked ALS as its primary recommendation algorithm - it maps perfectly onto MapReduce.

Parameter	Typical values	Effect
k (factors)	20-200	Larger k → more accurate, but slower and at risk of overfitting
λ (regularization)	0.01-1.0	Larger λ → more robust, but less accurate (underfitting)
Iterations	10-30	ALS converges quickly - 10-15 iterations are usually sufficient
Initialization	N(0, 0.1)	Poor initialization → slow convergence, risk of a bad local minimum

**SGD vs ALS:** the alternative to ALS is **Stochastic Gradient Descent (SGD)**. SGD processes one rating at a time (fast updates); ALS processes an entire row or column (more stable steps). SGD wins on dense data; ALS parallelizes better and handles implicit feedback more naturally. The Netflix Prize winner ran SGD; Spark MLlib runs ALS.

Why does ALS alternate between optimizing P and Q rather than optimizing them jointly?

NMF: Interpretable Non-Negative Factors

SVD and ALS produce factors with **negative** values. What does -0.5 on factor 3 actually mean for a user? Hard to interpret. **NMF (Non-negative Matrix Factorization)** bolts on one constraint: **all values in P and Q must be non-negative**. That single constraint transforms interpretation - factors become **additive parts**.

The intuition: a film's rating is the sum of "how much sci-fi it contains" + "how much drama" + "how much comedy". Each component is >= 0 - a film cannot have "negative comedy". NMF finds exactly that decomposition: **R ≈ W · H**, where every element of W and H is >= 0. - **W** (users x k) - how strongly each user "is into" each factor - **H** (k x items) - how much each item "contains" each factor

Property	SVD	NMF
Factor values	Arbitrary (including negative)	Non-negative only (>= 0)
Interpretability	Low (axes without meaning)	High (additive parts)
Accuracy	Maximum (for rank-k)	Slightly below SVD
Uniqueness	Unique decomposition	Not unique (depends on initialization)
Sparsity	Dense factors	Can produce sparse factors
Application	General purpose, dimensionality reduction	When interpretation matters: topic modeling, recommendations

**NMF and topic modeling.** NMF goes far beyond recommendations. Feed it a TF-IDF matrix (documents x words) and NMF surfaces "topics": each factor is a set of weighted words. Topic 1: {machine: 0.8, learning: 0.7, neural: 0.6}. Topic 2: {market: 0.9, stock: 0.7, price: 0.6}. Non-negativity is the key - a word cannot "negatively" belong to a topic.

**NMF refuses negative data.** Mean-center the ratings (subtract the mean) and values go negative - NMF errors out or yields nonsense. For NMF, feed raw ratings (1-5 stars) or normalize to the [0, 1] range. The constraint is the price of interpretability.

**When to pick NMF over SVD?** Whenever the recommendation must be **explainable**: "We recommend this film because this user enjoys sci-fi (factor 1) and films with unexpected endings (factor 3)". SVD cannot deliver that - its factors mix positive and negative components.

What is the main advantage of NMF over SVD in recommender systems?

Implicit Feedback: Clicks Instead of Stars

Netflix asks users to rate films 1-5 stars. But 99% of user interactions happen **without any rating**: clicks, views, time on a page, purchases, cart additions, track skips. This is **implicit feedback** - the system watches what users do rather than asking their opinion. And there is orders of magnitude more of it.

Key differences from explicit ratings: - **No negative signals.** A user did not click an item - that does not mean dislike. Maybe they just never noticed it. - **Variable signal strength.** Click - weak signal. Purchase - strong. Watching 95% of a film - very strong. Stopping after 2 minutes - negative. - **Noise.** Ad clicks, accidental mobile taps, gifts for someone else - all noise in implicit feedback.

2008. Hu, Koren, and Volinsky (Yahoo! and AT&T Labs) publish the landmark paper **"Collaborative Filtering for Implicit Feedback Datasets"**. Their key idea: introduce **confidence**. More interactions, higher confidence the user likes the item. Formula: **c_ui = 1 + α × r_ui**, where r_ui is the interaction count.

Property	Explicit Feedback	Implicit Feedback
Data	1-5 star ratings	Clicks, views, purchases, time spent
Volume	Small (1-3% of the matrix)	Massive (billions of events)
Negative signals	Rating 1-2 = dislike	No click != dislike
Noise	Low (deliberate rating)	High (accidental clicks, gifts)
Bias	Selection bias (users rate what they like)	Exposure bias (users click what they are shown)
Scalability	Poor (few users leave explicit ratings)	Excellent (every action is a signal)

**Confidence weighting is the heart of the implicit approach.** One click → c = 1 + 40×1 = 41. Thirty plays → c = 1 + 40×30 = 1201. Confidence in the preference scales with interaction count. Even zero cells get c = 1 - treated as weakly negative preferences. This kills the "no click != dislike" problem.

**Exposure bias is the main trap in implicit feedback.** Users only click items they **were shown**. If the system never displayed an item, there are no clicks, and the model concludes the item is uninteresting. A feedback loop: unseen items get no clicks → they rank lower → they get shown even less. The fix: exploration (random recommendations) and causal inference methods.

**In practice: combine implicit signal types.** Spotify runs a hierarchy of actions: skip after 5 seconds = strong negative; listened 70%+ = positive; added to playlist = strong positive; saved to library = very strong positive. Each action carries its own weight in the confidence calculation. Far richer than click counts alone.

Key Takeaways

**SVD** decomposes R ≈ U·Σ·V^T, extracting latent factors - 50-200 numbers that encode a user's taste and an item's essence. 7.5 trillion empty cells compressed to a few megabytes.
**ALS** solves the sparsity problem: it optimizes only on known ratings, alternating between updating P and Q. Each step is a linear regression, easily parallelized (Spark MLlib).
**NMF** trades accuracy for interpretability: non-negative factors are additive "parts" (genres, topics). Recommendations can be explained to the user.
**Implicit feedback** (Hu et al. 2008) - confidence-weighted ALS for clicks, views, and purchases. It scales better than explicit ratings, and it is what YouTube, Spotify, and TikTok are built on.

Вопросы для размышления

SVD extracts k latent factors. How is the optimal k determined? Which metrics are most useful: RMSE on validation, a scree plot, or business metrics (CTR, conversion rate)?
Implicit feedback contains no explicit negative signals - the user never said "I dislike this". How to distinguish "has not seen the item" from "saw it but was not interested"? What additional data would help?
NMF gives interpretable factors but is less accurate than SVD. In which products is recommendation explainability critical (e.g., medicine, finance), and in which is it less important (entertainment)?

Связанные уроки

rec-02 — Collaborative filtering is the foundation before matrix factorization
ml-19-pca — SVD is the mathematical basis of PCA - the same latent factors
rec-04 — Deep learning recommenders build on top of MF ideas
aie-09-embeddings — Embeddings in LLMs and user/item vectors share the same math
opt-01 — Optimization methods: ALS and SGD are special cases of general schemes
ml-01 — ML basics: overfitting and regularization needed to understand ALS
la-15-svd