Linear Algebra
Linear Algebra in ML: the Synthesis
GPT-4 performs over 1.8 trillion matrix multiplications in a single forward pass. Transformer attention, embedding lookup, PCA for data analysis, PageRank for search - each is a direct application of linear algebra concepts from previous lessons, now running at scale.
- Transformer: Attention(Q,K,V) = softmax(QKᵀ/√d)V - three matrix multiplies per head
- Embeddings: token → row of matrix E of size vocab × d_model
- PCA: centering + SVD; 30 components explain 95% of face variation in the AT&T dataset
- PageRank: eigenvector of a 50B-page matrix found via power iteration in ~50 steps
- GPU GEMM: all ML training reduces to C = A·B on Tensor Cores
Neural networks and attention
**A 4-layer neural network is exactly 4 matrix multiplications with nonlinearities in between.** Attention in a transformer is three matrices (Q, K, V), a dot product, and a softmax. PCA is SVD with one function call. Embeddings are rows of a weight matrix. All of modern ML is one big linear algebra course running on a GPU. This lesson introduces no new concepts - it is the final assembly: where exactly each idea from this course lives in real systems.
**Goal of this lesson**: not a recap of applications but an understanding of structure. After this, any ML paper with formulas reads like familiar text rather than a foreign language.
Neural Networks: a Chain of Linear Layers
Every linear layer in a neural network is the multiplication of the input vector by a weight matrix. The nonlinearity (ReLU, GELU) is inserted between layers - without it, a product of any number of matrices is still just one matrix, and all the depth of the network would be useless.
ONE LAYER: h = activation(W * x + b) x - input vector (n,) W - weight matrix (m, n) b - bias vector (m,) h - output vector (m,) FULL NETWORK (4-layer MNIST): x0 = 28x28 image -> vector (784,) h1 = ReLU(W1 * x0 + b1) W1 in R^(256x784) 200K parameters h2 = ReLU(W2 * h1 + b2) W2 in R^(128x256) 33K parameters h3 = ReLU(W3 * h2 + b3) W3 in R^(64x128) 8K parameters y = softmax(W4 * h3 + b4) W4 in R^(10x64) 640 parameters Total: ~242K parameters, 99%+ live inside the W matrices. BACKPROP: dL/dW = dL/dh * xt (outer product) dL/dx = Wt * dL/dh (transposition!) Transposition from the earlier lesson - this is where it lives.
**Why nonlinearities are necessary**: a product of two linear maps is still a linear map. W3*W2*W1 collapses into a single matrix. Without ReLU/GELU all the "depth" of a network is an illusion - one effective layer. Nonlinearity breaks that equality and makes depth genuinely matter.
Attention: Matrix Products at the Heart of the Transformer
GPT, BERT, LLaMA, Claude - all run on the attention mechanism. Under the hood: three weight matrices, two matrix multiplications, and a softmax. Literally a few lines of linear algebra.
GIVEN: sequence of token vectors X in R^(n x d) STEP 1 - project into Q, K, V: Q = X * Wq (n x d) * (d x dk) = (n x dk) 'Queries' K = X * Wk (n x d) * (d x dk) = (n x dk) 'Keys' V = X * Wv (n x d) * (d x dv) = (n x dv) 'Values' STEP 2 - compute attention weights: scores = Q * Kt / sqrt(dk) (n x n) - compatibility matrix A = softmax(scores) (n x n) - normalized weights STEP 3 - aggregate values: output = A * V (n x dv) - contextual representations FULL FORMULA: Attention(Q,K,V) = softmax(Q*Kt / sqrt(dk)) * V Q*Kt - dot products of all token pairs! Every token looks at every other token through a dot product.
**Q*Kt is a matrix of dot products**: element [i, j] measures how well query i matches key j. That is exactly cosine similarity (without normalization), discussed in the dot product lesson. Transformers are literally built on dot products.
Why do neural networks need non-linearities between linear layers?
W_2 (W_1 x) = (W_2 W_1) x - any number of linear layers without an activation is mathematically a single linear layer with weight matrix W_2 W_1. The expressive power comes from non-linearities (ReLU, GELU) placed between linear layers.
Embeddings, PCA, PageRank
Embeddings: a Weight Matrix as a Lookup Table
An embedding layer is simply a large matrix E of shape (vocab_size x d_model). Each token corresponds to one row. "Looking up an embedding" is multiplying a one-hot vector by the matrix - equivalent to indexing a row.
EMBEDDING MATRIX: E in R^(vocab x d) - for example, 50 000 x 768 (BERT-base) Parameters: 50 000 x 768 = 38 400 000 That is ~40% of all BERT parameters! LOOKING UP TOKEN i: one_hot in R^(vocab,) - 1 at position i, 0 everywhere else emb_i = one_hot * E = i-th row of E (matrix multiplication!) COSINE SIMILARITY BETWEEN TOKENS: sim(i, j) = E[i] * E[j] / (||E[i]|| * ||E[j]||) - 'cat' and 'kitten' -> high similarity - 'cat' and 'excavator' -> low similarity GPT-2 WEIGHT TYING: In GPT-2 the embedding matrix = output projection matrix (transposed) Savings: 50 257 x 768 = 38.5M parameters used twice
PCA: the Full Pipeline in 10 Lines
**PCA (Principal Component Analysis)** is the primary tool for dimensionality reduction. Under the hood: centering the data and SVD. Mathematically: find an orthogonal basis where data variance is maximized along the first axes.
GIVEN: X in R^(n x d) - n objects, d features 1. CENTERING: X_c = X - mean(X, axis=0) subtract the mean of each feature 2. SVD: X_c = U * Sigma * Vt 3. PRINCIPAL COMPONENTS: V = right singular vectors = directions of maximum variance Sigma^2 / (n-1) = variance along each component 4. PROJECT onto k components: X_reduced = X_c * V[:, :k] (n x k) 5. RECONSTRUCT (approximate): X_approx = X_reduced * V[:, :k]t + mean(X, axis=0) EXPLAINED VARIANCE RATIO: ratio_i = si^2 / sum(sj^2) Standard practice: choose k giving >= 95% cumulative variance
**Eigenfaces** - the classic PCA application: the matrix of face images (n_images x n_pixels), SVD yields "principal faces" - linear combinations explaining maximum variability. Face recognition in the AT&T Database ran on 40 eigenfaces. Modern methods replaced linear PCA with nonlinear autoencoders, but the idea is identical.
PageRank: Google Search as an Eigenvector Problem
PageRank by Larry Page and Sergey Brin (1998) finds the eigenvector of the web-graph transition matrix. A page is important if important pages link to it - circular logic that resolves into an eigenvector with lambda = 1.
TRANSITION MATRIX M: M[i,j] = 1/out_degree(j) if j links to i M[i,j] = 0 otherwise PAGERANK VECTOR r: M * r = r (eigenvector with lambda = 1) IN PRACTICE a damping factor d ~= 0.85 is added: M' = d * M + (1-d)/n * 1*1t Meaning: with probability 0.15 the user jumps to a random page POWER ITERATION: r0 = (1/n, 1/n, ..., 1/n) uniform start r(t+1) = M' * r(t) -> converges to the Perron-Frobenius eigenvector CONVERGENCE: |r(t+1) - r*| <= d * |r(t) - r*| 50-100 iterations give sufficient accuracy for web-scale search
**Spectral clustering** is a related method: build a similarity graph between objects, compute eigenvectors of the graph Laplacian matrix, and use them for clustering. Applied in image segmentation, social network analysis, and bioinformatics. Same mathematics, different application.
What linear algebra problem does PageRank solve?
PageRank (1998) defines a random surfer model where the transition matrix P_ij is the probability of going from page j to page i. The PageRank vector r is the dominant eigenvector: Pr = r with sum r_i = 1. Computed by power iteration: r_{k+1} = P r_k.
GPU matrix products and 3D graphics
Matrix Operations on the GPU: Why It Is So Fast
Training GPT-4 took ~100 days on thousands of A100s. Not because neural networks are "complex" - because **GEMM (General Matrix Multiply)** on a GPU runs thousands of times faster than on a CPU. All ML performance comes down to one operation.
REALISTIC SIZES (GPT-2 medium training): Batch size = 32, seq_len = 1024, d_model = 1024 One attention Q projection: X * Wq: (32*1024, 1024) * (1024, 1024) = (32768, 1024) FLOPS: 2 * 32768 * 1024 * 1024 ~= 69 GFLOPS Full forward pass one transformer block: ~550 GFLOPS 16 blocks of GPT-2: ~8.8 TFLOPS - exactly what an A100 delivers PARALLELISM: C[i,j] = sum_k A[i,k] * B[k,j] - each element is independent! GPU: 6912 CUDA cores compute different C[i,j] simultaneously Tensor Cores (A100): 312 TFLOPS for FP16 matrix multiply UNDER THE HOOD: cuBLAS (NVIDIA), rocBLAS (AMD), XLA (Google TPU) PyTorch calls them automatically for @, matmul, einsum
**Batch processing**: instead of multiplying W * x one vector at a time, multiply W * X where X holds an entire batch. Same GEMM, just larger - and GPUs love that. That is why batch size affects speed so strongly: a small batch leaves the GPU underutilized.
3D Graphics: MVP as a Matrix Product Chain
Every frame in a game or neural renderer puts all vertices through a chain of matrix transformations. On the GPU this runs in parallel for millions of vertices.
PIPELINE: v_clip = P * V * M * v_local M (Model matrix): local coords -> world coords (scale, rotation, translation) V (View matrix): world coords -> camera coords (inverse camera transform) P (Projection matrix): camera coords -> clip space (perspective divide) OPTIMIZATION: M, V, P do not change within a single draw call -> precompute MVP = P * V * M once on CPU -> on GPU: v_clip = MVP * v_local Savings: 3 matrix multiplications -> 1 FOR 60 FPS, 1M VERTICES: 60 * 1 000 000 multiplications of 4x4 by 4-vector = 1.44 GFLOPS GPU handles this in milliseconds
**Linear Algebra in the Modern ML Stack** Every layer of the stack relies on matrix operations - **Neural network (any)** (y = activation(Wx + b) at every layer): Linear layer = matrix multiplication. PyTorch nn.Linear = W @ x + b - **Transformer / Attention** (Attention(Q,K,V) = softmax(QKt/sqrt(d))V): Q*Kt - matrix of dot products. GPT, BERT, LLaMA, Claude - all here - **Embeddings** (Token -> row of matrix E): word2vec, GloVe, GPT BPE tokens. Cosine similarity = dot product of normalized rows - **PCA / SVD** (Dimensionality reduction, denoising): sklearn.PCA, TruncatedSVD, LoRA fine-tuning, Eigenfaces, LSA - **Recommender systems** (R ~= UVt (matrix factorization)): Two-tower models in YouTube/TikTok. SVD++ in Spotify. User-item dot product - **GPU / Hardware** (GEMM - General Matrix Multiply): cuBLAS, Tensor Cores, XLA TPU. All ML performance reduces to one matrix multiply speed
Practice: covariance matrix and variance
**Interview-style review questions:** - **Q:** Why does a product of any number of linear layers without nonlinearities collapse into a single linear layer? - Key points: A linear map f(x) = Wx preserves addition and scalar multiplication; Composition of two linear maps is still linear: W2(W1*x) = (W2*W1)*x; W3*W2*W1 is one matrix. Three layers without nonlinearities = one effective layer; ReLU/GELU break linearity and make network depth genuinely necessary - **Q:** How does the attention mechanism work from a linear algebra perspective? - Key points: Q, K, V are linear projections of input tokens (multiplied by learned matrices Wq, Wk, Wv); Q*Kt is a matrix of pairwise dot products: how well each query matches each key; Softmax normalizes the weights; A*V is the weighted sum of values by those weights; Multi-head attention runs several attention heads in parallel with different projection matrices - **Q:** Full PCA pipeline: from raw data to two principal components - what linear algebra steps are needed? - Key points: 1. Centering: X_c = X - mean(X, axis=0) - subtract the mean of each feature; 2. SVD: X_c = U*Sigma*Vt - right singular vectors V = directions of max variance; 3. Projection: X_2d = X_c * V[:, :2] - multiply by first two columns of V; Explained variance = si^2 / sum(sj^2) - used to choose the number of components
Why is GPU matrix multiplication 100x faster than CPU for neural networks?
A matrix product (m×k) * (k×n) is m*n independent inner products of length k. The GPU runs all m*n inner products in parallel across thousands of cores. The cuBLAS GEMM kernel reaches >90% of theoretical FLOPs on A100. CPUs are limited to 8-32 cores.
Final Synthesis of the Course
- **Neural network = chain of Wx + b**: every layer is a matrix; backprop = transposition and outer products
- **Attention = QKt + softmax + V**: dot products between token pairs determine who attends to whom
- **Embeddings = rows of matrix E**: cosine similarity = dot product of normalized rows
- **PCA = SVD**: center the data, take the right singular vectors - those are the principal components
- **PageRank = eigenvector**: the page ranking converges via power iteration to the Perron-Frobenius vector
- **GPU = fast GEMM**: all ML speedup reduces to one operation - matrix multiplication on Tensor Cores
Where These Ideas Came From
Each application is the payoff of one of the course lessons
- SVD — LoRA, LSA, PCA, pseudoinverse - all from singular value decomposition
- Dot Product — Cosine similarity, attention scores - the foundation of search and transformers
- Eigenvectors — PageRank, PCA, spectral clustering - three different applications of one idea