Numerical Methods

Numerical Methods in Machine Learning

GPT-4 trained on thousands of GPUs for months. Yet every operation in that process-finite differences, matrix factorizations, fp16 stability-is applied numerical mathematics. Without this foundation one cannot debug NaN mid-training or deploy quantized models without quality collapse.

  • **Mixed-precision training (bf16):** BERT, GPT, LLaMA all train in bf16 on A100/H100 - 2× speedup with fp32 stability, because bf16 shares the fp32 exponent range
  • **LoRA (Low-Rank Adaptation):** fine-tuning GPT-3 via randomized SVD - 10 000× fewer trainable parameters; the math is straight from Halko-Martinsson-Tropp
  • **K-FAC at DeepMind:** second-order optimizer for AlphaFold 2 used Kronecker-factored Fisher matrix - convergence in 10× fewer steps than Adam

Предварительные знания

  • Finite Element Method

Numerical Stability in Deep Learning

Deep learning is applied numerical mathematics. Most training bugs are fp16 overflow, gradient explosion, and vanishing gradients. Understanding floating-point formats lets one prevent these errors at the architecture level rather than chasing them in production.

**Floating-point formats:**

FormatSignExponentMantissaRange
fp3218 bits23 bits±3.4 × 10³⁸
fp1615 bits10 bits±65504 (small!)
bf1618 bits7 bits±3.4 × 10³⁸

**Gradient clipping:** if ||g||₂ > max_norm, then g ← g · max_norm / ||g||₂ **Loss scaling (fp16):** multiply loss by S = 2¹⁵ before backward pass, divide gradients by S before optimizer.step(). Guards against denormalized fp16 values.

**Softmax numerical instability:** exp(1000) = inf in fp32. The standard fix-subtract max(x) before softmax: softmax(x) = softmax(x − max(x)). PyTorch handles this automatically. Custom CUDA kernels are where this is routinely missed.

Why is bf16 preferred over fp16 for training transformers?

Second-Order Methods and Krylov Subspaces

SGD and Adam are first-order methods-they use only the gradient. L-BFGS and K-FAC are second-order-they account for curvature (Hessian). For small batches or tasks where convergence quality outweighs per-step cost, they are far more efficient. L-BFGS stores the last m (s, y) pairs instead of an N² Hessian matrix.

**L-BFGS: Hessian approximation via (s, y) pairs:** sₖ = xₖ₊₁ − xₖ (step in parameter space) yₖ = ∇f(xₖ₊₁) − ∇f(xₖ) (gradient change) Hₖ⁻¹ is approximated recursively from the last m = 10 - 20 pairs. Per-step cost: O(m·n) rather than O(n²) for full BFGS. **K-FAC (Kronecker-Factored Approximate Curvature):** Fisher factor F ≈ A ⊗ G (Kronecker product) A-input covariance, G-output gradient covariance Inversion: (A ⊗ G)⁻¹ = A⁻¹ ⊗ G⁻¹ 100× cheaper than full Fisher matrix! **When to choose L-BFGS over Adam:** - Full-batch optimization (entire dataset fits in memory) - Fine-tuning on small datasets - Physics-informed tasks where loss encodes PDEs

**Krylov connection:** the Krylov subspace Kₖ(A, b) = span{b, Ab, A²b, ..., Aᵏ⁻¹b} underlies CG and GMRES for linear systems. L-BFGS implicitly builds a Hessian approximation in a related Krylov-like space through the (s, y) pairs-that is why these methods share the same theoretical root.

Why is L-BFGS not used for training large language models?

Randomized SVD and the Halko Algorithm

Exact SVD of A ∈ ℝᵐˣⁿ costs O(mn·min(m,n)). For transformer weight matrices (12288 × 12288) that is billions of operations. The Halko-Martinsson-Tropp algorithm computes an approximate rank-k SVD in O(mn·log k)-100× faster when k ≪ min(m, n).

**Randomized SVD algorithm (Halko-Martinsson-Tropp, 2011):** 1. Draw random matrix Ω ∈ ℝⁿˣ(k+p), p = 10 (oversampling) 2. Compute sketch Y = A · Ω 3. QR decomposition: Y = QR (Q is an orthonormal basis for Range(A)) 4. Project: B = Qᵀ · A (small (k+p) × n matrix) 5. SVD of small matrix: B = ÛσVᵀ 6. Recover: U = Q · Û **Result:** Aₖ ≈ U[:,:k] · diag(σ[:k]) · V[:,:k]ᵀ Error bound: ||A − Aₖ||₂ ≤ C · σₖ₊₁(A) **ML applications:** - LoRA (Low-Rank Adaptation): W = W₀ + AB, rank(A,B) ≪ rank(W₀) - Embedding compression - PCA on millions of vectors

Why does randomized SVD use power iterations (n_iter=4)?

Automatic Differentiation vs Numerical Differentiation

PyTorch and JAX use automatic differentiation-exact derivatives via the chain rule applied to a computation graph. Numerical differentiation (finite differences) is useful for verification and black-box problems. The distinction affects accuracy, cost, and domain of applicability.

**Three modes of differentiation:** **Symbolic:** f(x) = x² → f'(x) = 2x. Exact, but expression size explodes. **Numerical (FD):** f'(x) ≈ (f(x+h) − f(x−h)) / (2h) - Error: O(h²) (central differences) - Pitfall: cancellation at small h (machine epsilon ≈ 1e-15) - Cost: O(n) function evaluations for n parameters **Automatic differentiation (AD):** - Forward mode: one forward pass → derivative with respect to one input - Reverse mode (backprop): one backward pass → ∂L/∂θ for ALL n parameters - Accuracy: machine precision (not an approximation!) - Backward pass cost: ≈ 3 - 5× forward pass **Rule of thumb:** if n parameters >> 1, always use reverse mode AD.

MethodAccuracyCost (n parameters)Use case
SymbolicExactExpression explosionAnalytics, CAS
Numerical FDO(h²)O(n) × cost(f)Verification, black box
AD forward modeMachine epsilonO(n) × cost(f)Jacobian, small n
AD reverse modeMachine epsilon≈ 5× cost(f)Neural nets, n >> 1

Why do neural networks use reverse mode AD rather than forward mode?

Key Ideas

  • **fp16 vs bf16:** bf16 has fp32 exponent range (8-bit) - no overflow, no GradScaler; fp16 max is 65504; GradScaler scales loss by 2¹⁵ then unscales gradients
  • **L-BFGS:** Hessian approximated via m (s,y) pairs; O(m·n) memory; requires a closure; breaks with mini-batch SGD due to noisy line search
  • **Randomized SVD:** rank-k approximation in O(mn·log k); power iterations amplify dominant singular values; foundation of LoRA
  • **Reverse mode AD:** one backward pass → ∂L/∂θ for all n parameters; machine-precision accuracy; gradcheck uses finite differences to verify it

Related Topics

Numerical methods in ML draw on the entire course:

  • Finite Element Method — Weak formulation and matrix assembly are the numerical core of physics-informed neural networks (PINNs)
  • Eigenvalue Problems — Randomized SVD reuses the same Krylov subspace ideas as the Lanczos algorithm for eigenvalue computation

Вопросы для размышления

  • Why does gradient checkpointing allow training models that exceed VRAM - and what numerical trade-off does it introduce?
  • How are Adagrad, Adam, and K-FAC related through the concept of a preconditioner for the gradient?
  • In JAX, grad(grad(f)) computes a second derivative at O(n) cost rather than O(n²) - how does this connect to dual numbers and forward-over-reverse AD?

Связанные уроки

  • la-01-vectors-intro
Numerical Methods in Machine Learning

0

1

Sign In