Algebra
Singular Value Decomposition
LoRA lets one fine-tune GPT-4-scale models for $100 instead of $1M-using the SVD insight that weight updates are low-rank. The Netflix Prize was won with matrix factorization (SVD). SVD is the universal X-ray of any dataset.
- **LoRA in LLMs:** fine-tuning via W + AB with rank 4-64 for matrices of size 4096×4096; used in llama.cpp, HuggingFace PEFT, QLoRA
- **PCA via SVD:** sklearn.PCA uses truncated SVD (ARPACK/randomized SVD); numerically more stable than eigendecomposing the covariance matrix
- **Recommender systems:** user×item rating matrix is factorized via SVD into latent factor spaces; Netflix, Spotify, Amazon all use variants
Предварительные знания
SVD: Definition and Geometry
For any matrix A ∈ ℝ^{m×n}, there exists a decomposition A = UΣVᵀ, where U ∈ ℝ^{m×m} and V ∈ ℝ^{n×n} are orthogonal, and Σ ∈ ℝ^{m×n} is diagonal with σ₁ ≥ σ₂ ≥ … ≥ σₙ ≥ 0 (singular values). Geometrically: Vᵀ rotates, Σ scales along axes, U rotates again.
**Connection to eigenvalues:** σᵢ = √λᵢ(AᵀA), where λᵢ are eigenvalues of the PSD matrix AᵀA. Columns of V are eigenvectors of AᵀA; columns of U are eigenvectors of AAᵀ.
Use np.linalg.svd(A, full_matrices=False) for the thin SVD-it only returns the rank(A) singular triplets and is significantly faster for tall/wide matrices.
For a 5×3 full-rank matrix A, the thin SVD A = UΣVᵀ has shapes:
Eckart-Young Theorem: Best Low-Rank Approximation
The Eckart-Young theorem (1936): the best rank-k approximation of A in the Frobenius (and spectral) norm is Aₖ = U_k Σ_k V_kᵀ, the sum of the first k SVD terms. Error: ||A − Aₖ||_F = √(σ²_{k+1} + … + σ²_r).
**LoRA (Low-Rank Adaptation)** in LLMs: instead of fine-tuning the full weight matrix W ∈ ℝ^{d×d}, train only W + AB where A ∈ ℝ^{d×r}, B ∈ ℝ^{r×d} and r ≪ d. This is exactly a low-rank addition to the weights. Memory savings: 2dr vs d² parameters.
Matrix A has singular values σ₁=10, σ₂=5, σ₃=1. The Frobenius error of the best rank-2 approximation is:
Moore-Penrose Pseudoinverse
The Moore-Penrose pseudoinverse: A⁺ = VΣ⁺Uᵀ, where Σ⁺ is formed by replacing nonzero σᵢ with 1/σᵢ. A⁺b gives the minimum-norm least-squares solution. For full column rank: A⁺ = (AᵀA)⁻¹Aᵀ (left inverse for overdetermined systems).
Explicit pseudoinverse computation is numerically unstable. Use np.linalg.lstsq for least squares and np.linalg.pinv only when one need the pseudoinverse matrix itself. Both use SVD internally.
In A⁺ = VΣ⁺Uᵀ, what is Σ⁺?
PCA, LoRA, and Recommender Systems
PCA via SVD: for a centered data matrix X (rows = samples), compute X = UΣVᵀ. Principal components are the columns of V; projections are XVₖ = UₖΣₖ; explained variance ratio is σᵢ²/Σσᵢ². LoRA: W_adapted = W_pretrained + AB, rank(AB) = r ≪ min(m,n).
| Application | Matrix | What we take | Why |
|---|---|---|---|
| PCA | X (data) | V: right singular vectors | Dimensionality reduction |
| LoRA | ΔW = AB | Low-rank addition | Efficient LLM fine-tuning |
| Recommender SVD | R (ratings) | U (users) and V (items) | Matrix factorization |
| LSA (NLP) | TF-IDF matrix | Semantic concepts | Latent Semantic Analysis |
LoRA uses W + AB instead of full W. The number of trainable parameters drops from d² to:
Key Ideas
- **A = UΣVᵀ:** U,V orthogonal; σᵢ = √λᵢ(AᵀA); geometrically: rotate, scale along axes, rotate back
- **Eckart-Young:** best rank-k approximation = first k SVD terms; error = √(Σᵢ>ₖ σᵢ²)
- **A⁺ = VΣ⁺Uᵀ:** Moore-Penrose pseudoinverse; A⁺b = minimum-norm least-squares solution
- **LoRA = low-rank addition:** W + AB, rank = r ≪ d; saves 2dr vs d² parameters; PCA uses the same idea
Related Topics
SVD generalizes eigendecomposition to rectangular matrices:
- Eigenvalues and Eigenvectors — σᵢ = √λᵢ(AᵀA); SVD simultaneously eigendecomposes AᵀA and AAᵀ
- Quadratic Forms — AᵀA is PSD; its eigenvalues = σᵢ²-the connection between SVD and quadratic forms
- Linear Algebra in ML and Graphics — LoRA, Flash Attention, matrix factorization all use SVD ideas
Вопросы для размышления
- LoRA uses rank 4-64 for matrices of size 4096×4096. Why does such a small rank work? What does this tell us about the structure of weight updates during fine-tuning?
- PCA via SVD and via eigendecomposition of the covariance matrix give the same result. What is the numerical difference between the two approaches?
- Randomized SVD (Halko-Martinsson-Tropp, 2011) computes an approximate SVD in O(mnk) instead of O(mn·min(m,n)). How do random projections help speed up the computation?