Linear Algebra

Linear Algebra in Deep Learning

GPT-4 performs 1.8 trillion matrix multiplications per forward pass - not metaphorically, literally. The rank of the weight matrices, their spectral norm, their condition number are not abstract notions but the characteristics that govern training speed, stability, and generalization.

  • LoRA: the low-rank hypothesis dW = BA lets you fine-tune GPT-3 on one GPU instead of eight
  • Flash Attention: tiled processing of QK^T - O(n) memory instead of O(n^2), context of 128K tokens
  • Spectral Normalization: W/sigma_1(W) stabilizes GAN training (Miyato et al. 2018)
  • Batch Normalization: centering and rescaling activations - a diagonal matrix operation
  • Gradient clipping: capping the gradient norm - standard for training Transformers

**GPT-4 performs 1.8 trillion matrix multiplications per forward pass.** Not 'applies neural operations' - matrix multiplications. Each token traverses hundreds of layers of y = Wx + b, every attention head computes QK^T / sqrt(d), every backward pass uses transposition and the chain rule. Linear algebra is not the description language of neural networks. It is what they are when you look inside.

**What this lesson is really about**: not that neural networks use matrices - that part is obvious. The point is that the specific properties of matrices (rank, norm, condition number, transpose) decide how a model trains, generalizes, and scales. After this lesson GPT, LoRA, Flash Attention stop being black boxes.

Forward pass: matrix multiplications everywhere

A linear layer of a neural network is exactly a matrix-vector multiplication: **y = Wx + b**, with W of shape (out, in). GPT-2 has an FFN block 768 -> 3072 -> 768: W1 in R^(3072 x 768), W2 in R^(768 x 3072). Processing a batch of 32 sequences of length 1024 means multiplying matrices of shape (32 x 1024) x 768 - a GEMM kernel on the GPU.

Backprop: transpose as the engine of the chain rule

Backpropagation is the chain rule expressed via the transpose. If y = Wx, the gradient w.r.t. the weights is **dL/dW = (dL/dy) * x^T**, and the gradient w.r.t. the input is **dL/dx = W^T * (dL/dy)**. The transpose is not accidental: it is the ~adjoint map~{adjoint map} from linear algebra. Dimensions check out automatically - if W^T does not fit, there is a bug somewhere in the chain.

**Gradient clipping** is just rescaling the gradient vector. PyTorch does `torch.nn.utils.clip_grad_norm_(params, max_norm=1.0)`, equivalent to: if ||g|| > 1.0, set g = g / ||g||. A pure linear-algebra operation - scaling a vector to unit norm.

Attention: three matmuls per head

Scaled dot-product attention is one formula with three matrix multiplications. **QK^T** gives the similarity matrix (seq x seq), the divide by sqrt(d_k) prevents softmax saturation, the multiply by **V** weights the values. Complexity O(seq^2 * d_k) - the quadratic dependence on sequence length is exactly what motivated Flash Attention, which avoids materializing that matrix.

Batch Normalization: the centering matrix

Batch Normalization looks like a simple normalization, but in linear-algebra terms it is a projection onto a hyperplane. Subtracting the mean applies the matrix **(I - (1/m)11^T)** - the centering projection that removes the component along the all-ones vector. Dividing by the standard deviation is a diagonal scaling. Learnable gamma, beta let the network undo the normalization if the task benefits.

LoRA: fine-tuning via low-rank matrices

LoRA (Low-Rank Adaptation) is built on one observation: **the weight changes during fine-tuning of an LLM are low rank**. Instead of updating the full matrix W with d^2 parameters, train two small matrices B and A such that W_new = W0 + BA. For GPT-3 with d=12288 and r=4: instead of 150M parameters, 98K. This is rank decomposition in action, the same idea as in SVD.

**Why B starts at zero**: at the start LoRA should be an identity supplement that does not change the base model's behavior. B=0 guarantees BA = 0, so W_new = W0. If instead A=0, no gradient would flow through B.

Weight initialization: Xavier and He through variance

Initialize weights too small and the signal dies. Initialize them too large and it explodes. The key question: what Var(W) keeps the variance of activations constant across layers? **Xavier** (Glorot, 2010) derives Var(w) = 2/(n_in + n_out) for tanh and sigmoid. **He** (Kaiming, 2015) derives Var(w) = 2/n_in for ReLU: the factor 2 compensates for ReLU killing ~50 percent of neurons.

Practice: forward pass of a neural network

Rank, norm, and condition number of neural-network weight matrices

A weight matrix $W$ in a neural network is not just a bag of numbers. Its **rank** sets the layer's information capacity. Its **spectral norm** $\|W\|_2 = \sigma_1$ (largest singular value) is the layer's Lipschitz constant: how much it can stretch an input vector. The **condition number** $\kappa(W) = \sigma_1/\sigma_{\min}$ shapes convergence speed.

**LoRA (Low-Rank Adaptation)**: instead of updating the full $W \in \mathbb{R}^{m \times n}$, parameterize the update as $\Delta W = BA$, with $B \in \mathbb{R}^{m \times r}$, $A \in \mathbb{R}^{r \times n}$, $r \ll \min(m,n)$. Parameters: $(m+n)r$ instead of $mn$. For GPT-3 ($m=n=12288$, $r=8$): around 0.01 percent of the original parameters. Hypothesis: fine-tuning changes have low rank.

Spectral Normalization for stable GAN training

Controlling the Lipschitz constant

GANs are notoriously unstable: the discriminator can change its predictions abruptly. Spectral Normalization (Miyato et al., 2018): divide every weight matrix by its spectral norm $W \leftarrow W / \sigma_1(W)$. This guarantees $\|W\|_2 = 1$ and a Lipschitz constant of at most 1 per layer. $\sigma_1$ is computed by 1-2 iterations of the power method.

Why does LoRA use the decomposition $\Delta W = BA$ with $r \ll n$ instead of updating $W$ directly?

When fine-tuning an LLM, $\Delta W$ is empirically close to a low-rank matrix. LoRA exploits this: $\text{rank}(BA) \leq r$. At $m=n=4096$, $r=8$: $16M$ vs $66K$ parameters. That makes fine-tuning GPT-3 on a single GPU possible.

Batch Normalization and Flash Attention through the matrix lens

Batch Normalization normalizes activations across a batch: $\hat{x}_i = (x_i - \mu)/\sigma$, then $y_i = \gamma \hat{x}_i + \beta$. In matrix form: $\hat{X} = (X - \mathbf{1}\mu^\top) \text{diag}(1/\sigma)$. The parameters $\gamma, \beta$ are diagonal matrices learned through backprop.

**Flash Attention** (Dao et al., 2022): standard attention computes $S = QK^\top$ - the full $n \times n$ matrix in GPU HBM ($O(n^2)$ memory). Flash Attention tiles the same computation using $O(n)$ GPU SRAM: it splits $Q,K,V$ into blocks and uses an online softmax. The result: 2-4x speedup, $O(n)$ memory instead of $O(n^2)$. That is why GPT-4 handles a 128K-token context.

Condition number and convergence speed

Why Adam beats SGD

For a loss $L(w) = \frac{1}{2}w^\top H w$ (quadratic approximation), SGD requires $O(\kappa(H) \log(1/\varepsilon))$ iterations, where $\kappa = \lambda_{\max}/\lambda_{\min}$. Adam adapts the step per parameter, effectively approximating $H^{-1}$ and reducing effective $\kappa$. A bad condition number of the Hessian is one reason plain SGD trains Transformers slowly.

Flash Attention computes the same result as standard attention, only faster. What is the key difference?

Standard attention: compute $S = QK^\top/\sqrt{d}$ ($O(n^2)$ entries), apply softmax to $S$, multiply by $V$. Flash Attention: split into blocks, update running max and sum for softmax online. Mathematically identical, but $O(n)$ HBM memory instead of $O(n^2)$.

Let L be a scalar, y = Wx, W in R^(m x n), x in R^n dL/dW = dL/dy * x^T in R^(m x n) [outer product] dL/dx = W^T * dL/dy in R^n [W^T undoes the action of W] Dimension check: dL/dy in R^m, x in R^n -> dL/dy * x^T in R^(m x n) = shape of W OK W^T in R^(n x m), dL/dy in R^m -> W^T * dL/dy in R^n = shape of x OK

Q, K, V in R^(seq x d_k) Step 1: S = QK^T / sqrt(d_k) in R^(seq x seq) O(seq^2 * d_k) FLOPS Step 2: A = softmax(S) in R^(seq x seq) O(seq^2) FLOPS Step 3: Out = A * V in R^(seq x d_k) O(seq^2 * d_k) FLOPS GPT-4 (estimated): seq=32768, d_k=128, 128 heads Memory for S: 32768^2 * 4 bytes = 4 GB per layer(!) -> Flash Attention: does not materialize S, blocks through SRAM

Weight matrix: W in R^(d x d), parameters: d^2 LoRA: W_new = W0 + B*A, B in R^(d x r), A in R^(r x d) LoRA parameters: d*r + r*d = 2dr GPT-3, d=12288, r=4: Full: 12288^2 = 150,994,944 LoRA: 2 x 12288 x 4 = 98,304 Savings: 1537x fewer parameters at comparable quality Why it works: the intrinsic dimensionality hypothesis - the space of useful weight updates is low-dimensional

  • **Linear / FFN layer**: matrix multiplication
  • **Self-Attention**: QK^T V via three matmuls
  • **Batch / Layer Normalization**: projection plus scaling
  • **Backpropagation**: transpose plus chain rule
  • **LoRA fine-tuning**: low-rank decomposition

Упражнения

  1. Why does LoRA work - why are the weight updates during fine-tuning low rank? — Intrinsic dimensionality hypothesis: most useful updates live in a low-dimensional subspace; Empirically: the rank of dW during GPT-3 fine-tuning on a specific task is around 4-16; LoRA can struggle when the task is far from the pretraining distribution - the rank is then higher; Tie to SVD: small singular values of W correspond to directions the model does not use
  2. Why are attention scores divided by sqrt(d_k) rather than d_k? — q*k = sum(q_i * k_i); if q_i, k_i ~ N(0,1) then Var(q*k) = d_k; For large d_k, std(scores) = sqrt(d_k): softmax saturates, one token gets weight ~1; Dividing by sqrt(d_k) brings Var(scores) back to 1 - softmax works well; At d_k = 64 without normalization std = 8, softmax becomes nearly one-hot, gradients vanish

A neural network through linear algebra

  • Forward pass is a chain of matmuls: y = Wx + b; GEMM accounts for over 90 percent of inference time
  • Backprop is transposition: dL/dW = delta * x^T, dL/dx = W^T * delta; dimensions self-check
  • Attention = QK^T/sqrt(d) + softmax + V; O(seq^2) memory; Flash Attention skips the materialization
  • BatchNorm = projection (I - 11^T/m) plus diagonal scaling; learnable gamma, beta
  • LoRA: dW = BA, r << d; 100-1500x fewer parameters; works because updates have low intrinsic dim
  • He init: Var(w) = 2/n_in for ReLU; the factor 2 compensates for the zeroed-out neurons
  • Gradient clipping: g = g * c/||g|| - vector rescaling, not coordinate clipping

Related topics

Spectral graph theory applies the same tools to structured data. GNNs combine matmul and the Laplacian in one forward pass.

  • Spectral graph theory — GNNs use the normalized Laplacian as the attention analog for graph data
  • SVD — LoRA is an SVD-style low-rank idea; randomized SVD in sklearn uses the same trick
  • Eigenvectors and eigenvalues — PCA as data preprocessing for a network; spectral features in GNNs

Связанные уроки

  • ml-25-neural-networks
  • ml-29-cnn
Linear Algebra in Deep Learning

0

1

Sign In