Linear Algebra
Matrix Rank: How Much Information Survives
The Netflix rating matrix - 480K users by 17K movies - is almost entirely zeros. Yet its rank is low: people's tastes are driven by dozens of latent factors, not millions of independent preferences. Matrix rank is the number of dimensions of information that actually survive.
- Netflix Prize: the rating matrix has low rank - ~20 latent factors explain most of the variance
- LoRA: weight updates ΔW during fine-tuning are empirically low-rank
- Computer vision: the matrix of face images has rank ~40 (eigenfaces)
- Model diagnostics: rank of the Jacobian determines how many parameters are identifiable
- Compression: storing a rank-r matrix as AB uses (m+n)r instead of mn entries
Rank and Gaussian elimination
**PCA on ImageNet compresses a 224x224x3 = 150,528-feature matrix down to a few dozen components, losing less than 5% of variance.** Behind this feat is a single number: rank. A matrix with 150,528 columns does not contain 150,528 independent directions. Rank is that smaller count: the **dimension of the column space**, the amount of truly independent information a matrix carries.
Real-world data is almost always low-rank by nature. The 1.2M images in ImageNet do not live in a 150,528-dimensional space; they inhabit a manifold of far smaller intrinsic dimension. The Netflix rating matrix (480K users by 17K movies) has an effective rank of around 40. Weight matrices in trained transformers follow the same pattern. Rank measures the gap between a matrix's shape and its true information capacity.
Definition: Geometric Meaning
A matrix A of size m-by-n defines a linear map from R^n to R^m. The image Im A, the set of all vectors Ax, is a subspace of R^m. The dimension of that subspace is the rank.
A (3x3) with rank(A) = r maps R^3 into a subspace of dimension r: r = 3: Im A = all of R^3. The map is invertible; no information is lost. r = 2: Im A = a plane inside R^3. One axis collapses to a point. r = 1: Im A = a line. Two dimensions are destroyed. r = 0: A = 0. All information is gone. Lost dimensions are unrecoverable: the inverse matrix does not exist when r < n. rank(A) = dim(Im A) = dim(col A)
A key identity: **row rank equals column rank**. This is not obvious. The same number describes both the row space and the column space. It follows from Gaussian elimination: elementary row operations preserve row rank and, more subtly, also preserve column rank.
Computing Rank: Gaussian Elimination
The algorithm: reduce the matrix to row echelon form (REF) via elementary row operations, then count the nonzero rows. Elementary row operations do not change rank; they change the basis of the row space but not its dimension.
**Numerical vs algebraic rank**: the matrix [[1, 0], [0, 1e-16]] has algebraic rank 2, but in float64 the second singular value falls below tol and numpy reports rank 1. For real data with noise, this is the correct behavior. The default tol in `np.linalg.matrix_rank` is max(M,N) * max(sv) * eps, a sensible choice for practical computation.
What does rank equal after reducing a matrix to row echelon form?
After Gaussian reduction to row echelon form, each non-zero row has a leading entry (pivot). The number of pivots equals the rank. Equivalently: the number of basis columns equals column rank equals row rank.
Rank-Nullity theorem
Rank-Nullity Theorem
The input space of A (m-by-n) has dimension n. Part of it maps into Im A (dimension rank), and the rest collapses to the zero vector: that is ker A. The rank-nullity theorem captures this balance.
**Multicollinearity = low rank of the feature matrix.** When a dataset contains both 'area in m^2' and 'area in ft^2', those columns are linearly dependent and rank(X) < n. The normal equation X^T X x = X^T y loses invertibility: det(X^T X) = 0. Ridge regression fixes this by adding lambda*I, which guarantees rank(X^T X + lambda*I) = n for any lambda > 0.
Rank and Solvability of Ax = b
| Condition | Geometric meaning | Result |
|---|---|---|
| rank(A) = rank(A|b) = n | b in Im A, ker A = {0} | Unique solution |
| rank(A) = rank(A|b) < n | b in Im A, ker A non-trivial | Infinitely many solutions |
| rank(A) < rank(A|b) | b not in Im A | No solution (inconsistent) |
(A|b) is the augmented matrix. If appending column b raises the rank, then b lies outside the image of A: the right-hand side vector is beyond what A can produce.
What does the rank-nullity theorem say for an m×n matrix A?
The domain has dimension n. It splits into rank (size of image) plus nullity (size of kernel). Every input direction either maps to the image (rank contribution) or collapses to zero (nullity contribution).
Rank, determinant, and ML applications
LoRA: Low-Rank Adaptation of Neural Networks
LoRA (Low-Rank Adaptation, Microsoft 2021) fine-tunes large models with 100-1000x less memory. The core observation: weight updates during fine-tuning live on a low-rank subspace. Instead of storing the full delta_W in R^{m times n}, the update is factored into a product of two thin matrices.
Full update: delta_W in R^{4096 x 4096} = 16,777,216 parameters LoRA with r = 8: A in R^{8 x 4096} = 32,768 parameters B in R^{4096 x 8} = 32,768 parameters Total: 65,536 (256x fewer) With r = 16 (QLoRA standard): 2 * 4096 * 16 = 131,072 (128x fewer) rank(BA) <= r by the rank-of-product theorem.
**Why this works**: LoRA assumes that the fine-tuning task lives on a low-rank subspace of the weight space. Empirically, models fine-tuned with r=8 are nearly indistinguishable in quality from full fine-tuning at r=4096. This points to a fundamental redundancy in neural networks.
SVD and the Eckart-Young Theorem
Any real matrix decomposes as A = U Sigma V^T, where U and V are orthogonal and Sigma is diagonal with nonneg entries (singular values). The Eckart-Young theorem states that truncating to the top k singular values gives the **optimal** rank-k approximation in both the Frobenius and spectral norms.
Key Properties of Rank
- **rank(A) = rank(A^T)** -- row rank equals column rank
- **rank(A) <= min(m, n)** -- bounded by the smaller dimension
- **rank(AB) <= min(rank(A), rank(B))** -- multiplication cannot increase rank
- **rank(A + B) <= rank(A) + rank(B)** -- addition is bounded by the sum of ranks
- Elementary row and column operations do not change rank
- rank(A) = number of nonzero singular values = number of pivot rows in REF
Where Rank Shows Up in ML
**Rank in industry applications** Key places where matrix rank is the decisive concept - **LoRA / QLoRA** (Fine-tuning via delta_W = BA, rank(BA) <= r): Llama, Mistral, Stable Diffusion. r = 4..64. 100-1000x memory reduction. - **PCA / SVD** (Rank-k approximation is optimal compression): ImageNet: 150,528 -> 50 components. Explained variance proportional to sigma_i^2. - **Recommender systems** (Matrix factorization: R ~ UV^T, rank = r): Netflix, Spotify. Users and items share a low-rank latent space. - **Data diagnostics** (rank(X) < n signals multicollinearity): Detect duplicate or linearly dependent features. Ridge/Lasso as fix. - **Model compression** (SVD weights, keep top-k components, run on edge devices): Faster inference on iOS/Android without retraining.
Practice: Low-Rank Recommender
**Interview-style review questions:** - **Q:** The matrix X^T X appears in the normal equation of linear regression. What happens when rank(X) < n? - Key points: X^T X becomes singular (det = 0), no inverse exists; The normal equation has infinitely many solutions: coefficients are not uniquely determined; Root cause: multicollinearity, some features are linearly dependent; Ridge regression adds lambda*I: rank(X^T X + lambda*I) = n guaranteed for any lambda > 0 - **Q:** LoRA sets rank = 8 instead of 4096 for a 4096-by-4096 weight matrix. Why does this work? - Key points: delta_W = BA has rank <= 8 by the rank-of-product theorem; Fine-tuning empirically lives on a low-rank subspace of weight space; Parameters: 2 * 4096 * 8 = 65K instead of 16M, a 256x saving; Neural networks are over-parameterized: tasks require far fewer effective degrees of freedom - **Q:** Why is the Eckart-Young theorem central to PCA and image compression? - Key points: Among all rank-k matrices, A_k = U_k Sigma_k V_k^T is closest to A in Frobenius norm; Error = sqrt(sigma_{k+1}^2 + ... + sigma_r^2), determined by discarded singular values; When singular values decay fast (as in real data), small k gives small error; PCA is SVD of the data matrix: first k components capture maximum variance
Why does LoRA work: a low-rank product B*A can express useful weight updates?
Microsoft showed in 2021 that fine-tuning a large LLM effectively lives in a low-dimensional subspace (~8-64 for typical tasks). So B*A with rank r=8 captures 99% of the useful information in DeltaW while costing <1% of the parameters.
Key Takeaways
- **Rank = dimension of the image**: the number of independent directions a matrix preserves
- **rank + nullity = n**: the balance between image and kernel, rank-nullity theorem
- **Gaussian elimination**: reduce to REF, count nonzero rows
- **SVD**: rank(A) = number of nonzero sigma_i. A_k = optimal rank-k approximation (Eckart-Young)
- **LoRA**: rank(delta_W) <= r << min(m,n). Fine-tuning tasks live on low-rank subspaces
- **rank(AB) <= min(rank(A), rank(B))**: a bottleneck layer caps the rank of the entire chain
What Comes Next
Rank connects nearly everything in linear algebra
- SVD (Singular Value Decomposition) — SVD makes rank explicit via singular values; foundation of low-rank approximation
- LU Decomposition — Gaussian elimination for rank computation is LU factorization at its core
- Matrix Inverse — Exists if and only if rank = n (full rank)
- PCA and Eigenvalues — rank(A) = number of nonzero eigenvalues for symmetric matrices. PCA finds principal components via eigenvectors