Numerical Methods

Numerical Methods in Machine Learning

GPT-4 trained on thousands of GPUs for months. Yet every operation in that process-finite differences, matrix factorizations, fp16 stability-is applied numerical mathematics. Without this foundation one cannot debug NaN mid-training or deploy quantized models without quality collapse.

**Mixed-precision training (bf16):** BERT, GPT, LLaMA all train in bf16 on A100/H100 - 2× speedup with fp32 stability, because bf16 shares the fp32 exponent range
**LoRA (Low-Rank Adaptation):** fine-tuning GPT-3 via randomized SVD - 10 000× fewer trainable parameters; the math is straight from Halko-Martinsson-Tropp
**K-FAC at DeepMind:** second-order optimizer for AlphaFold 2 used Kronecker-factored Fisher matrix - convergence in 10× fewer steps than Adam

Предварительные знания

Finite Element Method

Numerical Stability in Deep Learning

Deep learning is applied numerical mathematics. Most training bugs are fp16 overflow, gradient explosion, and vanishing gradients. Understanding floating-point formats lets one prevent these errors at the architecture level rather than chasing them in production.

**Floating-point formats:**

Format	Sign	Exponent	Mantissa	Range
fp32	1	8 bits	23 bits	±3.4 × 10³⁸
fp16	1	5 bits	10 bits	±65504 (small!)
bf16	1	8 bits	7 bits	±3.4 × 10³⁸

**Gradient clipping:** if ||g||₂ > max_norm, then g ← g · max_norm / ||g||₂ **Loss scaling (fp16):** multiply loss by S = 2¹⁵ before backward pass, divide gradients by S before optimizer.step(). Guards against denormalized fp16 values.

**Softmax numerical instability:** exp(1000) = inf in fp32. The standard fix-subtract max(x) before softmax: softmax(x) = softmax(x − max(x)). PyTorch handles this automatically. Custom CUDA kernels are where this is routinely missed.

Why is bf16 preferred over fp16 for training transformers?

Second-Order Methods and Krylov Subspaces

SGD and Adam are first-order methods-they use only the gradient. L-BFGS and K-FAC are second-order-they account for curvature (Hessian). For small batches or tasks where convergence quality outweighs per-step cost, they are far more efficient. L-BFGS stores the last m (s, y) pairs instead of an N² Hessian matrix.

**L-BFGS: Hessian approximation via (s, y) pairs:** sₖ = xₖ₊₁ − xₖ (step in parameter space) yₖ = ∇f(xₖ₊₁) − ∇f(xₖ) (gradient change) Hₖ⁻¹ is approximated recursively from the last m = 10 - 20 pairs. Per-step cost: O(m·n) rather than O(n²) for full BFGS. **K-FAC (Kronecker-Factored Approximate Curvature):** Fisher factor F ≈ A ⊗ G (Kronecker product) A-input covariance, G-output gradient covariance Inversion: (A ⊗ G)⁻¹ = A⁻¹ ⊗ G⁻¹ 100× cheaper than full Fisher matrix! **When to choose L-BFGS over Adam:** - Full-batch optimization (entire dataset fits in memory) - Fine-tuning on small datasets - Physics-informed tasks where loss encodes PDEs

**Krylov connection:** the Krylov subspace Kₖ(A, b) = span{b, Ab, A²b, ..., Aᵏ⁻¹b} underlies CG and GMRES for linear systems. L-BFGS implicitly builds a Hessian approximation in a related Krylov-like space through the (s, y) pairs-that is why these methods share the same theoretical root.

Why is L-BFGS not used for training large language models?

Randomized SVD and the Halko Algorithm

Exact SVD of A ∈ ℝᵐˣⁿ costs O(mn·min(m,n)). For transformer weight matrices (12288 × 12288) that is billions of operations. The Halko-Martinsson-Tropp algorithm computes an approximate rank-k SVD in O(mn·log k)-100× faster when k ≪ min(m, n).

**Randomized SVD algorithm (Halko-Martinsson-Tropp, 2011):** 1. Draw random matrix Ω ∈ ℝⁿˣ(k+p), p = 10 (oversampling) 2. Compute sketch Y = A · Ω 3. QR decomposition: Y = QR (Q is an orthonormal basis for Range(A)) 4. Project: B = Qᵀ · A (small (k+p) × n matrix) 5. SVD of small matrix: B = ÛσVᵀ 6. Recover: U = Q · Û **Result:** Aₖ ≈ U[:,:k] · diag(σ[:k]) · V[:,:k]ᵀ Error bound: ||A − Aₖ||₂ ≤ C · σₖ₊₁(A) **ML applications:** - LoRA (Low-Rank Adaptation): W = W₀ + AB, rank(A,B) ≪ rank(W₀) - Embedding compression - PCA on millions of vectors

Why does randomized SVD use power iterations (n_iter=4)?

Automatic Differentiation vs Numerical Differentiation

PyTorch and JAX use automatic differentiation-exact derivatives via the chain rule applied to a computation graph. Numerical differentiation (finite differences) is useful for verification and black-box problems. The distinction affects accuracy, cost, and domain of applicability.

**Three modes of differentiation:** **Symbolic:** f(x) = x² → f'(x) = 2x. Exact, but expression size explodes. **Numerical (FD):** f'(x) ≈ (f(x+h) − f(x−h)) / (2h) - Error: O(h²) (central differences) - Pitfall: cancellation at small h (machine epsilon ≈ 1e-15) - Cost: O(n) function evaluations for n parameters **Automatic differentiation (AD):** - Forward mode: one forward pass → derivative with respect to one input - Reverse mode (backprop): one backward pass → ∂L/∂θ for ALL n parameters - Accuracy: machine precision (not an approximation!) - Backward pass cost: ≈ 3 - 5× forward pass **Rule of thumb:** if n parameters >> 1, always use reverse mode AD.

Method	Accuracy	Cost (n parameters)	Use case
Symbolic	Exact	Expression explosion	Analytics, CAS
Numerical FD	O(h²)	O(n) × cost(f)	Verification, black box
AD forward mode	Machine epsilon	O(n) × cost(f)	Jacobian, small n
AD reverse mode	Machine epsilon	≈ 5× cost(f)	Neural nets, n >> 1

Why do neural networks use reverse mode AD rather than forward mode?

Key Ideas

**fp16 vs bf16:** bf16 has fp32 exponent range (8-bit) - no overflow, no GradScaler; fp16 max is 65504; GradScaler scales loss by 2¹⁵ then unscales gradients
**L-BFGS:** Hessian approximated via m (s,y) pairs; O(m·n) memory; requires a closure; breaks with mini-batch SGD due to noisy line search
**Randomized SVD:** rank-k approximation in O(mn·log k); power iterations amplify dominant singular values; foundation of LoRA
**Reverse mode AD:** one backward pass → ∂L/∂θ for all n parameters; machine-precision accuracy; gradcheck uses finite differences to verify it

Вопросы для размышления

Why does gradient checkpointing allow training models that exceed VRAM - and what numerical trade-off does it introduce?
How are Adagrad, Adam, and K-FAC related through the concept of a preconditioner for the gradient?
In JAX, grad(grad(f)) computes a second derivative at O(n) cost rather than O(n²) - how does this connect to dual numbers and forward-over-reverse AD?

Связанные уроки

la-01-vectors-intro

Numerical Methods

Numerical Methods in Machine Learning

**Mixed-precision training (bf16):** BERT, GPT, LLaMA all train in bf16 on A100/H100 - 2× speedup with fp32 stability, because bf16 shares the fp32 exponent range
**LoRA (Low-Rank Adaptation):** fine-tuning GPT-3 via randomized SVD - 10 000× fewer trainable parameters; the math is straight from Halko-Martinsson-Tropp
**K-FAC at DeepMind:** second-order optimizer for AlphaFold 2 used Kronecker-factored Fisher matrix - convergence in 10× fewer steps than Adam

Предварительные знания

Finite Element Method

Numerical Stability in Deep Learning

**Floating-point formats:**

Format	Sign	Exponent	Mantissa	Range
fp32	1	8 bits	23 bits	±3.4 × 10³⁸
fp16	1	5 bits	10 bits	±65504 (small!)
bf16	1	8 bits	7 bits	±3.4 × 10³⁸

Why is bf16 preferred over fp16 for training transformers?

Second-Order Methods and Krylov Subspaces

Why is L-BFGS not used for training large language models?

Randomized SVD and the Halko Algorithm

Why does randomized SVD use power iterations (n_iter=4)?

Automatic Differentiation vs Numerical Differentiation

Method	Accuracy	Cost (n parameters)	Use case
Symbolic	Exact	Expression explosion	Analytics, CAS
Numerical FD	O(h²)	O(n) × cost(f)	Verification, black box
AD forward mode	Machine epsilon	O(n) × cost(f)	Jacobian, small n
AD reverse mode	Machine epsilon	≈ 5× cost(f)	Neural nets, n >> 1

Why do neural networks use reverse mode AD rather than forward mode?

Key Ideas

**fp16 vs bf16:** bf16 has fp32 exponent range (8-bit) - no overflow, no GradScaler; fp16 max is 65504; GradScaler scales loss by 2¹⁵ then unscales gradients
**L-BFGS:** Hessian approximated via m (s,y) pairs; O(m·n) memory; requires a closure; breaks with mini-batch SGD due to noisy line search
**Randomized SVD:** rank-k approximation in O(mn·log k); power iterations amplify dominant singular values; foundation of LoRA
**Reverse mode AD:** one backward pass → ∂L/∂θ for all n parameters; machine-precision accuracy; gradcheck uses finite differences to verify it

Вопросы для размышления

Why does gradient checkpointing allow training models that exceed VRAM - and what numerical trade-off does it introduce?
How are Adagrad, Adam, and K-FAC related through the concept of a preconditioner for the gradient?
In JAX, grad(grad(f)) computes a second derivative at O(n) cost rather than O(n²) - how does this connect to dual numbers and forward-over-reverse AD?

Связанные уроки

la-01-vectors-intro

Numerical Methods in Machine Learning

Предварительные знания

Numerical Stability in Deep Learning

Second-Order Methods and Krylov Subspaces

Randomized SVD and the Halko Algorithm

Automatic Differentiation vs Numerical Differentiation

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Numerical Methods in Machine Learning

Предварительные знания

Numerical Stability in Deep Learning

Second-Order Methods and Krylov Subspaces

Randomized SVD and the Halko Algorithm

Automatic Differentiation vs Numerical Differentiation

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки