Algebra
Linear Algebra in ML and Graphics
A transformer run involves 12+ matrix multiplications per token. Flash Attention makes GPT-4-scale models 3× faster through one algorithmic insight: tile the matrices to stay in fast SRAM. A WebGPU shader for a 3D scene multiplies 4×4 matrices for every vertex, 60 times per second. Linear algebra isn't abstract-it's running right now, at massive scale.
- **Flash Attention:** O(T) memory instead of O(T²) via tiled softmax; 2-4× speedup on long contexts; shipped in torch.nn.functional.scaled_dot_product_attention (PyTorch 2.0+)
- **LoRA + SVD:** fine-tune LLaMA-7B on one GPU in hours; rank=16 for 4096×4096 matrices = 131K trainable params instead of 16M; stored as ΔW = AB added on top of frozen weights
- **3D Graphics:** WebGPU/Vulkan/OpenGL-all transforms are 4×4 matrix multiplies; character animation via quaternions + SLERP; NeRF and 3D Gaussian Splatting use the same MVP matrices for differentiable rendering
Предварительные знания
Matrix Factorizations in ML
GPT-4 has 1.8 trillion parameters organized as matrices. QR decomposition initializes weights orthogonally. SVD powers LoRA - fine-tuning a 4096×4096 matrix with rank-16 reduces 16M trainable parameters to 131K. Cholesky factorization makes VAE reparameterization exact. These three factorizations are what make training at scale feasible.
| Factorization | Requirement | Cost | ML Applications |
|---|---|---|---|
| QR: A = QR | Any A | O(mn²) | Stable least squares, orthogonal init, QR iteration |
| SVD: A = UΣVᵀ | Any A | O(mn·min(m,n)) | PCA, LoRA, matrix factorization, pseudoinverse |
| Cholesky: Σ = LLᵀ | PSD matrix | O(n³/3) | Gaussian sampling, Kalman filter, GP regression |
**Reparameterization trick** in VAEs: to backpropagate through sampling, write z = μ + Lε where ε ~ N(0,I) and L is the Cholesky factor of the covariance matrix. Gradients flow through μ and L automatically; the randomness is isolated in ε.
Which factorization is standard for sampling from a multivariate Gaussian N(μ, Σ) with full covariance?
Attention Mechanism as Linear Algebra
Self-attention is a weighted combination of value vectors: Attention(Q,K,V) = softmax(QKᵀ/√d_k)V. Q, K, V ∈ ℝ^{T×d} are linear projections of input X through learned weight matrices W_Q, W_K, W_V. The entire mechanism is matrix multiplications plus a softmax nonlinearity.
**Flash Attention** (Dao et al. 2022): standard self-attention materializes the T×T score matrix-at T=4096 that's 64 MB per GPU. Flash Attention computes softmax and the V-weighted sum in tiles, staying in SRAM throughout. Result: O(T) memory instead of O(T²), 2-4× faster on long sequences.
Dividing by √d_k is essential: for d_k=64, random unit vectors have dot products of magnitude ~√d_k ≈ 8. Without normalization, softmax saturates and gradients vanish. This follows from the fact that E[‖x‖²] = d for a standard Gaussian, so dot products grow with dimension.
Why does Flash Attention require O(T) memory instead of O(T²)?
einsum and Tensor Operations
torch.einsum and numpy.einsum express any tensor contraction via Einstein summation notation: repeated indices are summed. 'ij,jk->ik' is matrix multiply; 'bthd,bshd->bhts' computes multi-head attention scores. The notation is readable, supports autograd, and gets compiled to efficient kernels.
Contraction order matters: for A(m×k), B(k×n), C(n×p), cost of (AB)C = O(mkn + mnp) while A(BC) = O(knp + mkp)-which is cheaper depends on the dimensions. opt_einsum finds the optimal order automatically; torch.einsum accepts an explicit path argument.
What does torch.einsum('ij,ij->i', A, B) compute for A, B ∈ ℝ^{m×n}?
3D Transforms and the MVP Pipeline
In 3D graphics (OpenGL, Vulkan, WebGPU) vertices pass through: Object → World → Camera → Clip → NDC. Each stage is a 4×4 matrix multiplication. Homogeneous coordinates [x,y,z,w] let one encode translation as matrix multiply and perform perspective division.
**Quaternions vs matrices:** a 4×4 rotation matrix takes 16 floats; a quaternion takes 4. For character animation, SLERP (Spherical Linear Interpolation) on quaternions gives the shortest rotation path on the 3-sphere. In shader code, quaternions get converted to 4×4 matrices because GPU SIMD is optimized for matrix-vector multiply.
| Space | Matrix | Content |
|---|---|---|
| Object → World | Model M | Scale, rotation, translation of the object |
| World → Camera | View V | Inverse of the camera's world transform |
| Camera → Clip | Projection P | Perspective (w≠1) or orthographic (w=1) |
| Clip → NDC | Divide by w | Not a matrix-per-vertex GPU operation |
Why do 3D graphics pipelines use homogeneous coordinates [x, y, z, w]?
Key Ideas
- **Three factorizations:** QR (stable least squares, orthogonal init), SVD (PCA/LoRA/pseudoinverse), Cholesky (Gaussian sampling, VAE reparameterization trick)
- **Attention:** Attention(Q,K,V) = softmax(QKᵀ/√d_k)V; Flash Attention = tiling → O(T) memory; √d_k normalization prevents softmax saturation
- **einsum:** Einstein notation for tensor contractions; 'ij,jk->ik' = matmul; 'bhqd,bhkd->bhqk' = multi-head attention scores; 'ij,ij->i' = per-row dot products
- **3D pipeline:** MVP = P·V·M; homogeneous coordinates [x,y,z,w] encode translation and perspective divide; quaternions for interpolation, matrices for GPU shaders
Related Topics
Applied linear algebra ties the whole course together:
- SVD — PCA, LoRA, pseudoinverse-the factorization behind most ML compression and adaptation techniques
- Inner Products and Orthogonality — Attention is a weighted sum via dot products; QR orthogonal initialization preserves gradient norms
- Linear Algebra in Interviews — FAANG questions about attention shapes, LoRA rank, and matrix complexities test understanding of these concepts
Вопросы для размышления
- Flash Attention uses tiling to compute softmax in blocks. Why can't standard softmax be computed row-by-row independently, and what online algorithm (Milakov & Gimelshein, 2018) makes block-wise computation exact?
- In a transformer, Q, K, V are all linear projections of the same input X through different weight matrices. Why are three separate weight matrices (W_Q, W_K, W_V) critical for expressiveness rather than one shared projection?
- LoRA initializes A with Gaussian noise and B with zeros so that ΔW = AB = 0 at training start. Why is it important to begin with zero delta? What would happen if both A and B were randomly initialized?