Linear Algebra

Nonnegative Matrix Factorization

Why does SVD decompose faces into ghostly global patterns while NMF finds a nose, a mouth, and eyes? What is the mathematical reason for this difference?

**Spotify:** topic profiling of 20 million tracks using 50 NMF latent factors - the engine behind Discover Weekly playlists
**Genomics:** decomposition of mutational signature matrices - tumor samples are mixtures of mutational processes (COSMIC SBS signatures)
**NLP:** topic modeling of documents without the probabilistic assumptions of LDA - NMF applied to TF-IDF matrices is often faster
**Recommender systems:** Netflix Prize matrix factorization - user-item rating matrix decomposed into latent preference factors

Предварительные знания

Singular Value Decomposition (SVD)
Frobenius norm
KKT conditions for constrained optimization

Definition and loss function

Daniel Lee and Sebastian Seung in 1999 published NMF in Nature. Decomposition of face images: with W, H >= 0 the components of W are local parts (nose, eye, mouth), not global features. Spotify applies NMF to build topic profiles for playlists: 20 million tracks times 50 latent factors, updated daily.

NMF solves the approximation problem V approx WH subject to the nonnegativity constraint. Physical meaning: each column of V (an image or document) is a nonnegative linear combination of columns of W (basis elements). This is the parts-of-whole interpretation that SVD cannot provide because SVD allows negative weights and therefore subtractive combinations.

The key difference from SVD: with W, H >= 0 each column of V is a nonnegative linear combination of the columns of W. This guarantees a parts-of-whole interpretation without mutual cancellation of components - something SVD cannot provide.

Why does NMF give more interpretable components than SVD?

SVD allows negative weights (subtraction of components). NMF with W, H >= 0 represents each object as a sum of non-negative parts. For faces: NMF gives interpretable parts (nose, mouth, eyes); SVD gives ghostly holographic patterns with mixed signs.

Multiplicative updates algorithm

The Lee-Seung algorithm uses multiplicative updates rather than additive ones - this preserves non-negativity of W and H at every iteration. The method converges monotonically to a local minimum of the Frobenius loss.

H_{a,mu} <- H_{a,mu} * (W^T V)_{a,mu} / (W^T W H)_{a,mu} W_{i,a} <- W_{i,a} * (V H^T)_{i,a} / (W H H^T)_{i,a} Each step preserves H, W >= 0 due to the multiplicative form. The loss ||V - WH||_F^2 decreases monotonically.

What guarantees monotone decrease of the Lee-Seung multiplicative updates?

Lee and Seung (2001) proved monotonicity by constructing G(h, h') >= L(h) with G(h, h) = L(h). Minimising G ensures L(t+1) <= L(t). The problem is not jointly convex in (W, H), only convex in each variable separately.

Sparse, orthogonal NMF and applications

Basic NMF extends with regularisers and constraints. Sparse NMF with an L1 penalty is the industry standard for topic modelling: each document is described by a small number of components. Orthogonal NMF (H H^T = I) reduces to hard clustering.

Uniqueness of NMF is not guaranteed: different random starting points W0, H0 lead to different local minima. In practice, several random initializations are run and the one with the lowest reconstruction error is chosen.

How does L1-regularised (sparse) NMF help in NLP?

The L1 penalty lambda * ||H||_1 forces entries of H to zero. A document is described by only a few active topics rather than all k - this matches reality: most articles cover 1-3 topics, not all 50 at once. It is the LASSO analogue for NMF.

Connections to other factorization methods

NMF occupies a unique position among matrix decompositions - between SVD and clustering.

SVD and PCA — Related topic
K-means clustering — Related topic
LDA (Latent Dirichlet Allocation) — Related topic
Tensor decompositions — Related topic

Итоги

NMF: V approx WH with W, H >= 0 - parts-of-whole decomposition without negative weights
Lee-Seung multiplicative updates monotonically reduce ||V - WH||_F^2 while preserving nonnegativity
NMF is NP-hard in general: the algorithm finds a local minimum that depends on initialization
Sparse NMF with L1 regularization and orthogonal NMF are extensions improving interpretability
Rank k is selected via the cophenetic coefficient or stability across multiple runs
Applications span genomics, NLP, recommender systems, and hyperspectral image analysis

What guarantees the monotone decrease property of NMF multiplicative updates?

Lee and Seung proved monotonicity by constructing an auxiliary function G with G(h, h') >= L(h) and G(h, h) = L(h). Each update minimizes G, which guarantees L(t+1) <= L(t). The problem is not jointly convex in (W, H), only separately convex in each variable.