Algebra

Linear Algebra in ML and Graphics

A transformer run involves 12+ matrix multiplications per token. Flash Attention makes GPT-4-scale models 3× faster through one algorithmic insight: tile the matrices to stay in fast SRAM. A WebGPU shader for a 3D scene multiplies 4×4 matrices for every vertex, 60 times per second. Linear algebra isn't abstract-it's running right now, at massive scale.

**Flash Attention:** O(T) memory instead of O(T²) via tiled softmax; 2-4× speedup on long contexts; shipped in torch.nn.functional.scaled_dot_product_attention (PyTorch 2.0+)
**LoRA + SVD:** fine-tune LLaMA-7B on one GPU in hours; rank=16 for 4096×4096 matrices = 131K trainable params instead of 16M; stored as ΔW = AB added on top of frozen weights
**3D Graphics:** WebGPU/Vulkan/OpenGL-all transforms are 4×4 matrix multiplies; character animation via quaternions + SLERP; NeRF and 3D Gaussian Splatting use the same MVP matrices for differentiable rendering

Предварительные знания

Singular Value Decomposition

Matrix Factorizations in ML

GPT-4 has 1.8 trillion parameters organized as matrices. QR decomposition initializes weights orthogonally. SVD powers LoRA - fine-tuning a 4096×4096 matrix with rank-16 reduces 16M trainable parameters to 131K. Cholesky factorization makes VAE reparameterization exact. These three factorizations are what make training at scale feasible.

Factorization	Requirement	Cost	ML Applications
QR: A = QR	Any A	O(mn²)	Stable least squares, orthogonal init, QR iteration
SVD: A = UΣVᵀ	Any A	O(mn·min(m,n))	PCA, LoRA, matrix factorization, pseudoinverse
Cholesky: Σ = LLᵀ	PSD matrix	O(n³/3)	Gaussian sampling, Kalman filter, GP regression

**Reparameterization trick** in VAEs: to backpropagate through sampling, write z = μ + Lε where ε ~ N(0,I) and L is the Cholesky factor of the covariance matrix. Gradients flow through μ and L automatically; the randomness is isolated in ε.

Which factorization is standard for sampling from a multivariate Gaussian N(μ, Σ) with full covariance?

Attention Mechanism as Linear Algebra

Self-attention is a weighted combination of value vectors: Attention(Q,K,V) = softmax(QKᵀ/√d_k)V. Q, K, V ∈ ℝ^{T×d} are linear projections of input X through learned weight matrices W_Q, W_K, W_V. The entire mechanism is matrix multiplications plus a softmax nonlinearity.

**Flash Attention** (Dao et al. 2022): standard self-attention materializes the T×T score matrix-at T=4096 that's 64 MB per GPU. Flash Attention computes softmax and the V-weighted sum in tiles, staying in SRAM throughout. Result: O(T) memory instead of O(T²), 2-4× faster on long sequences.

Dividing by √d_k is essential: for d_k=64, random unit vectors have dot products of magnitude ~√d_k ≈ 8. Without normalization, softmax saturates and gradients vanish. This follows from the fact that E[‖x‖²] = d for a standard Gaussian, so dot products grow with dimension.

Why does Flash Attention require O(T) memory instead of O(T²)?

einsum and Tensor Operations

torch.einsum and numpy.einsum express any tensor contraction via Einstein summation notation: repeated indices are summed. 'ij,jk->ik' is matrix multiply; 'bthd,bshd->bhts' computes multi-head attention scores. The notation is readable, supports autograd, and gets compiled to efficient kernels.

Contraction order matters: for A(m×k), B(k×n), C(n×p), cost of (AB)C = O(mkn + mnp) while A(BC) = O(knp + mkp)-which is cheaper depends on the dimensions. opt_einsum finds the optimal order automatically; torch.einsum accepts an explicit path argument.

What does torch.einsum('ij,ij->i', A, B) compute for A, B ∈ ℝ^{m×n}?

3D Transforms and the MVP Pipeline

In 3D graphics (OpenGL, Vulkan, WebGPU) vertices pass through: Object → World → Camera → Clip → NDC. Each stage is a 4×4 matrix multiplication. Homogeneous coordinates [x,y,z,w] let one encode translation as matrix multiply and perform perspective division.

**Quaternions vs matrices:** a 4×4 rotation matrix takes 16 floats; a quaternion takes 4. For character animation, SLERP (Spherical Linear Interpolation) on quaternions gives the shortest rotation path on the 3-sphere. In shader code, quaternions get converted to 4×4 matrices because GPU SIMD is optimized for matrix-vector multiply.

Space	Matrix	Content
Object → World	Model M	Scale, rotation, translation of the object
World → Camera	View V	Inverse of the camera's world transform
Camera → Clip	Projection P	Perspective (w≠1) or orthographic (w=1)
Clip → NDC	Divide by w	Not a matrix-per-vertex GPU operation

Why do 3D graphics pipelines use homogeneous coordinates [x, y, z, w]?

Key Ideas

**Three factorizations:** QR (stable least squares, orthogonal init), SVD (PCA/LoRA/pseudoinverse), Cholesky (Gaussian sampling, VAE reparameterization trick)
**Attention:** Attention(Q,K,V) = softmax(QKᵀ/√d_k)V; Flash Attention = tiling → O(T) memory; √d_k normalization prevents softmax saturation
**einsum:** Einstein notation for tensor contractions; 'ij,jk->ik' = matmul; 'bhqd,bhkd->bhqk' = multi-head attention scores; 'ij,ij->i' = per-row dot products
**3D pipeline:** MVP = P·V·M; homogeneous coordinates [x,y,z,w] encode translation and perspective divide; quaternions for interpolation, matrices for GPU shaders

Вопросы для размышления

Flash Attention uses tiling to compute softmax in blocks. Why can't standard softmax be computed row-by-row independently, and what online algorithm (Milakov & Gimelshein, 2018) makes block-wise computation exact?
In a transformer, Q, K, V are all linear projections of the same input X through different weight matrices. Why are three separate weight matrices (W_Q, W_K, W_V) critical for expressiveness rather than one shared projection?
LoRA initializes A with Gaussian noise and B with zeros so that ΔW = AB = 0 at training start. Why is it important to begin with zero delta? What would happen if both A and B were randomly initialized?

Связанные уроки

la-01-vectors-intro

Algebra

Linear Algebra in ML and Graphics

**Flash Attention:** O(T) memory instead of O(T²) via tiled softmax; 2-4× speedup on long contexts; shipped in torch.nn.functional.scaled_dot_product_attention (PyTorch 2.0+)
**LoRA + SVD:** fine-tune LLaMA-7B on one GPU in hours; rank=16 for 4096×4096 matrices = 131K trainable params instead of 16M; stored as ΔW = AB added on top of frozen weights
**3D Graphics:** WebGPU/Vulkan/OpenGL-all transforms are 4×4 matrix multiplies; character animation via quaternions + SLERP; NeRF and 3D Gaussian Splatting use the same MVP matrices for differentiable rendering

Предварительные знания

Singular Value Decomposition

Matrix Factorizations in ML

Factorization	Requirement	Cost	ML Applications
QR: A = QR	Any A	O(mn²)	Stable least squares, orthogonal init, QR iteration
SVD: A = UΣVᵀ	Any A	O(mn·min(m,n))	PCA, LoRA, matrix factorization, pseudoinverse
Cholesky: Σ = LLᵀ	PSD matrix	O(n³/3)	Gaussian sampling, Kalman filter, GP regression

Which factorization is standard for sampling from a multivariate Gaussian N(μ, Σ) with full covariance?

Attention Mechanism as Linear Algebra

Why does Flash Attention require O(T) memory instead of O(T²)?

einsum and Tensor Operations

What does torch.einsum('ij,ij->i', A, B) compute for A, B ∈ ℝ^{m×n}?

3D Transforms and the MVP Pipeline

Space	Matrix	Content
Object → World	Model M	Scale, rotation, translation of the object
World → Camera	View V	Inverse of the camera's world transform
Camera → Clip	Projection P	Perspective (w≠1) or orthographic (w=1)
Clip → NDC	Divide by w	Not a matrix-per-vertex GPU operation

Why do 3D graphics pipelines use homogeneous coordinates [x, y, z, w]?

Key Ideas

**Three factorizations:** QR (stable least squares, orthogonal init), SVD (PCA/LoRA/pseudoinverse), Cholesky (Gaussian sampling, VAE reparameterization trick)
**Attention:** Attention(Q,K,V) = softmax(QKᵀ/√d_k)V; Flash Attention = tiling → O(T) memory; √d_k normalization prevents softmax saturation
**einsum:** Einstein notation for tensor contractions; 'ij,jk->ik' = matmul; 'bhqd,bhkd->bhqk' = multi-head attention scores; 'ij,ij->i' = per-row dot products
**3D pipeline:** MVP = P·V·M; homogeneous coordinates [x,y,z,w] encode translation and perspective divide; quaternions for interpolation, matrices for GPU shaders

Вопросы для размышления

Flash Attention uses tiling to compute softmax in blocks. Why can't standard softmax be computed row-by-row independently, and what online algorithm (Milakov & Gimelshein, 2018) makes block-wise computation exact?
In a transformer, Q, K, V are all linear projections of the same input X through different weight matrices. Why are three separate weight matrices (W_Q, W_K, W_V) critical for expressiveness rather than one shared projection?
LoRA initializes A with Gaussian noise and B with zeros so that ΔW = AB = 0 at training start. Why is it important to begin with zero delta? What would happen if both A and B were randomly initialized?

Связанные уроки

la-01-vectors-intro

Linear Algebra in ML and Graphics

Предварительные знания

Matrix Factorizations in ML

Attention Mechanism as Linear Algebra

einsum and Tensor Operations

3D Transforms and the MVP Pipeline

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Linear Algebra in ML and Graphics

Предварительные знания

Matrix Factorizations in ML

Attention Mechanism as Linear Algebra

einsum and Tensor Operations

3D Transforms and the MVP Pipeline

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки