Algebra

Linear Algebra in ML and Graphics

A transformer run involves 12+ matrix multiplications per token. Flash Attention makes GPT-4-scale models 3× faster through one algorithmic insight: tile the matrices to stay in fast SRAM. A WebGPU shader for a 3D scene multiplies 4×4 matrices for every vertex, 60 times per second. Linear algebra isn't abstract-it's running right now, at massive scale.

  • **Flash Attention:** O(T) memory instead of O(T²) via tiled softmax; 2-4× speedup on long contexts; shipped in torch.nn.functional.scaled_dot_product_attention (PyTorch 2.0+)
  • **LoRA + SVD:** fine-tune LLaMA-7B on one GPU in hours; rank=16 for 4096×4096 matrices = 131K trainable params instead of 16M; stored as ΔW = AB added on top of frozen weights
  • **3D Graphics:** WebGPU/Vulkan/OpenGL-all transforms are 4×4 matrix multiplies; character animation via quaternions + SLERP; NeRF and 3D Gaussian Splatting use the same MVP matrices for differentiable rendering

Предварительные знания

  • Singular Value Decomposition

Matrix Factorizations in ML

GPT-4 has 1.8 trillion parameters organized as matrices. QR decomposition initializes weights orthogonally. SVD powers LoRA - fine-tuning a 4096×4096 matrix with rank-16 reduces 16M trainable parameters to 131K. Cholesky factorization makes VAE reparameterization exact. These three factorizations are what make training at scale feasible.

FactorizationRequirementCostML Applications
QR: A = QRAny AO(mn²)Stable least squares, orthogonal init, QR iteration
SVD: A = UΣVᵀAny AO(mn·min(m,n))PCA, LoRA, matrix factorization, pseudoinverse
Cholesky: Σ = LLᵀPSD matrixO(n³/3)Gaussian sampling, Kalman filter, GP regression

**Reparameterization trick** in VAEs: to backpropagate through sampling, write z = μ + Lε where ε ~ N(0,I) and L is the Cholesky factor of the covariance matrix. Gradients flow through μ and L automatically; the randomness is isolated in ε.

Which factorization is standard for sampling from a multivariate Gaussian N(μ, Σ) with full covariance?

Attention Mechanism as Linear Algebra

Self-attention is a weighted combination of value vectors: Attention(Q,K,V) = softmax(QKᵀ/√d_k)V. Q, K, V ∈ ℝ^{T×d} are linear projections of input X through learned weight matrices W_Q, W_K, W_V. The entire mechanism is matrix multiplications plus a softmax nonlinearity.

**Flash Attention** (Dao et al. 2022): standard self-attention materializes the T×T score matrix-at T=4096 that's 64 MB per GPU. Flash Attention computes softmax and the V-weighted sum in tiles, staying in SRAM throughout. Result: O(T) memory instead of O(T²), 2-4× faster on long sequences.

Dividing by √d_k is essential: for d_k=64, random unit vectors have dot products of magnitude ~√d_k ≈ 8. Without normalization, softmax saturates and gradients vanish. This follows from the fact that E[‖x‖²] = d for a standard Gaussian, so dot products grow with dimension.

Why does Flash Attention require O(T) memory instead of O(T²)?

einsum and Tensor Operations

torch.einsum and numpy.einsum express any tensor contraction via Einstein summation notation: repeated indices are summed. 'ij,jk->ik' is matrix multiply; 'bthd,bshd->bhts' computes multi-head attention scores. The notation is readable, supports autograd, and gets compiled to efficient kernels.

Contraction order matters: for A(m×k), B(k×n), C(n×p), cost of (AB)C = O(mkn + mnp) while A(BC) = O(knp + mkp)-which is cheaper depends on the dimensions. opt_einsum finds the optimal order automatically; torch.einsum accepts an explicit path argument.

What does torch.einsum('ij,ij->i', A, B) compute for A, B ∈ ℝ^{m×n}?

3D Transforms and the MVP Pipeline

In 3D graphics (OpenGL, Vulkan, WebGPU) vertices pass through: Object → World → Camera → Clip → NDC. Each stage is a 4×4 matrix multiplication. Homogeneous coordinates [x,y,z,w] let one encode translation as matrix multiply and perform perspective division.

**Quaternions vs matrices:** a 4×4 rotation matrix takes 16 floats; a quaternion takes 4. For character animation, SLERP (Spherical Linear Interpolation) on quaternions gives the shortest rotation path on the 3-sphere. In shader code, quaternions get converted to 4×4 matrices because GPU SIMD is optimized for matrix-vector multiply.

SpaceMatrixContent
Object → WorldModel MScale, rotation, translation of the object
World → CameraView VInverse of the camera's world transform
Camera → ClipProjection PPerspective (w≠1) or orthographic (w=1)
Clip → NDCDivide by wNot a matrix-per-vertex GPU operation

Why do 3D graphics pipelines use homogeneous coordinates [x, y, z, w]?

Key Ideas

  • **Three factorizations:** QR (stable least squares, orthogonal init), SVD (PCA/LoRA/pseudoinverse), Cholesky (Gaussian sampling, VAE reparameterization trick)
  • **Attention:** Attention(Q,K,V) = softmax(QKᵀ/√d_k)V; Flash Attention = tiling → O(T) memory; √d_k normalization prevents softmax saturation
  • **einsum:** Einstein notation for tensor contractions; 'ij,jk->ik' = matmul; 'bhqd,bhkd->bhqk' = multi-head attention scores; 'ij,ij->i' = per-row dot products
  • **3D pipeline:** MVP = P·V·M; homogeneous coordinates [x,y,z,w] encode translation and perspective divide; quaternions for interpolation, matrices for GPU shaders

Related Topics

Applied linear algebra ties the whole course together:

  • SVD — PCA, LoRA, pseudoinverse-the factorization behind most ML compression and adaptation techniques
  • Inner Products and Orthogonality — Attention is a weighted sum via dot products; QR orthogonal initialization preserves gradient norms
  • Linear Algebra in Interviews — FAANG questions about attention shapes, LoRA rank, and matrix complexities test understanding of these concepts

Вопросы для размышления

  • Flash Attention uses tiling to compute softmax in blocks. Why can't standard softmax be computed row-by-row independently, and what online algorithm (Milakov & Gimelshein, 2018) makes block-wise computation exact?
  • In a transformer, Q, K, V are all linear projections of the same input X through different weight matrices. Why are three separate weight matrices (W_Q, W_K, W_V) critical for expressiveness rather than one shared projection?
  • LoRA initializes A with Gaussian noise and B with zeros so that ΔW = AB = 0 at training start. Why is it important to begin with zero delta? What would happen if both A and B were randomly initialized?

Связанные уроки

  • la-01-vectors-intro
Linear Algebra in ML and Graphics

0

1

Sign In