Optimization
Nonlinear Optimization
Training GPT takes millions of Adam iterations, while a physics simulation problem is solved by L-BFGS in dozens. The choice of optimizer is not a detail-it's an architectural decision. Understanding second-order methods explains why Adam and Newton are not interchangeable.
- **scipy.optimize.minimize:** standard for scientific computing; L-BFGS-B solves problems with thousands of parameters in seconds
- **PyTorch LBFGS:** used for physics-informed neural networks (PINN), style transfer, fine-tuning small models on small datasets
- **Second-order at Google:** K-FAC (Kronecker-Factored Approximate Curvature) approximates the Fisher matrix for transformer training-applied second-order optimization at scale
Предварительные знания
Gradient Descent and Line Search
Gradient descent: xk+1 = xk - alpha_k * nabla f(xk). The critical choice is the step size alpha_k. Too small-slow convergence. Too large-divergence. **Line search** solves this: it finds alpha_k satisfying the Wolfe conditions (sufficient decrease + curvature condition).
**Wolfe conditions** are the standard for line search: 1. Armijo/sufficient decrease: f drops enough 2. curvature: the step isn't too small (gradient still has downward component). Together they guarantee convergence of gradient descent to a local optimum.
| Step strategy | Pros | Cons |
|---|---|---|
| Fixed alpha | Simple | Needs tuning; may diverge |
| Backtracking (Armijo) | Simple to implement | Only decrease condition |
| Wolfe conditions | Convergence guarantee | More expensive; two conditions |
| Exact line search | Optimal step | Very expensive; rarely needed |
Why is backtracking line search preferable to a fixed step size in gradient descent?
Newton's Method
Newton's method uses both the gradient and **curvature** (Hessian): xk+1 = xk - [nabla^2 f(xk)]^{-1} nabla f(xk). Geometrically: build a quadratic approximation and jump to its minimum. This gives **quadratic convergence**: ||xk+1 - x*|| <= C * ||xk - x*||^2.
**Newton's method bottleneck:** computing and storing the Hessian costs O(n^2) memory and O(n^3) time for inversion. For a neural network with 10^8 parameters this is physically impossible. Solution: quasi-Newton methods (BFGS, L-BFGS) that approximate the Hessian without computing it.
Newton's method has quadratic convergence but is rarely used for neural network training. Why?
BFGS and Quasi-Newton
**BFGS** (Broyden-Fletcher-Goldfarb-Shanno, 1970) approximates the inverse Hessian Hk using only gradients. It uses a rank-2 update: Hk+1 = Hk + (rank-2 correction from sk and yk), where sk = xk+1-xk, yk = nabla f(xk+1)-nabla f(xk).
**BFGS rank-2 update:** Hk+1 = (I - rho_k*sk*yk^T)*Hk*(I - rho_k*yk*sk^T) + rho_k*sk*sk^T, where rho_k = 1/(yk^T*sk). The update costs O(n^2) (vector-vector products), not O(n^3) for Hessian inversion. Hk remains positive definite when Wolfe conditions hold.
What is the fundamental difference between BFGS and Newton's method?
L-BFGS: Limited Memory
**L-BFGS** (Limited-memory BFGS) scales BFGS to large n. Instead of storing Hk (O(n^2) memory), it keeps only the last m pairs (sk, yk). The update uses a two-loop recursion costing O(mn) instead of O(n^2).
| Method | Convergence | Memory | When to use |
|---|---|---|---|
| GD | Linear O(1/k) | O(n) | Simple convex problems, warm-up |
| BFGS | Superlinear | O(n^2) | n < 10^4, nonlinear problems |
| L-BFGS | Superlinear | O(mn) | n up to 10^6, ML model fine-tuning |
| Newton | Quadratic | O(n^2) | n < 10^3, mathematical optimization |
| Adam/SGD | Linear/sublinear | O(n) | n > 10^6, neural networks (non-convex) |
**Fine-tuning LLMs with L-BFGS:** for small LLMs (BERT-base, GPT-2) L-BFGS with gradient checkpointing can outperform Adam on small fine-tuning datasets. The catch: L-BFGS needs full batches-incompatible with stochastic noise from mini-batches.
L-BFGS is better than Adam for neural networks because it uses curvature information
Adam dominates for neural networks in practice because it's compatible with mini-batch SGD. L-BFGS requires accurate full-batch gradients, which is impractical for large datasets.
L-BFGS approximates the Hessian using accumulated gradient history. Stochastic gradients (mini-batch SGD) are too noisy-the Hessian approximation degrades. Adam adapts step sizes per-coordinate using first and second moment estimates, which works well with noisy gradients. That's why Adam dominates deep learning while L-BFGS excels in physics simulation, scientific computing, and fine-tuning on small datasets.
Why does L-BFGS store only m pairs (s, y) instead of the full Hessian approximation?
Key Ideas
- **Line search** adapts step size: backtracking (Armijo)-fast and simple; Wolfe conditions-convergence guarantee
- **Newton's method** uses the Hessian for quadratic convergence, but O(n^2) memory and O(n^3) time make it infeasible for large n
- **BFGS** approximates the inverse Hessian via rank-2 updates O(n^2); superlinear convergence
- **L-BFGS** stores only m pairs (s, y); O(mn) memory; standard for nonlinear problems up to n~10^6
Related Topics
Second-order methods complement SGD for different regimes:
- Stochastic Optimization — SGD and Adam are stochastic analogues of GD; L-BFGS is incompatible with mini-batch SGD
- Constrained Optimization — L-BFGS-B adds box constraints; KKT conditions generalize to general constraints
- Interior Point Methods — IPM uses Newton steps internally; shared math-inverting the barrier function Hessian
Вопросы для размышления
- Why doesn't Newton's quadratic convergence make it the best choice for neural network training? Which practical constraints matter more than theoretical convergence speed?
- L-BFGS stores only the last m Hessian updates. How does this affect approximation quality? What m is appropriate in practice?
- Which method fits: (a) logistic regression on 10^6 examples, (b) physics simulation with 1000 parameters, (c) training GPT-2?