Optimization

Nonlinear Optimization

Training GPT takes millions of Adam iterations, while a physics simulation problem is solved by L-BFGS in dozens. The choice of optimizer is not a detail-it's an architectural decision. Understanding second-order methods explains why Adam and Newton are not interchangeable.

  • **scipy.optimize.minimize:** standard for scientific computing; L-BFGS-B solves problems with thousands of parameters in seconds
  • **PyTorch LBFGS:** used for physics-informed neural networks (PINN), style transfer, fine-tuning small models on small datasets
  • **Second-order at Google:** K-FAC (Kronecker-Factored Approximate Curvature) approximates the Fisher matrix for transformer training-applied second-order optimization at scale

Предварительные знания

  • Interior Point Methods

Gradient Descent and Line Search

Gradient descent: xk+1 = xk - alpha_k * nabla f(xk). The critical choice is the step size alpha_k. Too small-slow convergence. Too large-divergence. **Line search** solves this: it finds alpha_k satisfying the Wolfe conditions (sufficient decrease + curvature condition).

**Wolfe conditions** are the standard for line search: 1. Armijo/sufficient decrease: f drops enough 2. curvature: the step isn't too small (gradient still has downward component). Together they guarantee convergence of gradient descent to a local optimum.

Step strategyProsCons
Fixed alphaSimpleNeeds tuning; may diverge
Backtracking (Armijo)Simple to implementOnly decrease condition
Wolfe conditionsConvergence guaranteeMore expensive; two conditions
Exact line searchOptimal stepVery expensive; rarely needed

Why is backtracking line search preferable to a fixed step size in gradient descent?

Newton's Method

Newton's method uses both the gradient and **curvature** (Hessian): xk+1 = xk - [nabla^2 f(xk)]^{-1} nabla f(xk). Geometrically: build a quadratic approximation and jump to its minimum. This gives **quadratic convergence**: ||xk+1 - x*|| <= C * ||xk - x*||^2.

**Newton's method bottleneck:** computing and storing the Hessian costs O(n^2) memory and O(n^3) time for inversion. For a neural network with 10^8 parameters this is physically impossible. Solution: quasi-Newton methods (BFGS, L-BFGS) that approximate the Hessian without computing it.

Newton's method has quadratic convergence but is rarely used for neural network training. Why?

BFGS and Quasi-Newton

**BFGS** (Broyden-Fletcher-Goldfarb-Shanno, 1970) approximates the inverse Hessian Hk using only gradients. It uses a rank-2 update: Hk+1 = Hk + (rank-2 correction from sk and yk), where sk = xk+1-xk, yk = nabla f(xk+1)-nabla f(xk).

**BFGS rank-2 update:** Hk+1 = (I - rho_k*sk*yk^T)*Hk*(I - rho_k*yk*sk^T) + rho_k*sk*sk^T, where rho_k = 1/(yk^T*sk). The update costs O(n^2) (vector-vector products), not O(n^3) for Hessian inversion. Hk remains positive definite when Wolfe conditions hold.

What is the fundamental difference between BFGS and Newton's method?

L-BFGS: Limited Memory

**L-BFGS** (Limited-memory BFGS) scales BFGS to large n. Instead of storing Hk (O(n^2) memory), it keeps only the last m pairs (sk, yk). The update uses a two-loop recursion costing O(mn) instead of O(n^2).

MethodConvergenceMemoryWhen to use
GDLinear O(1/k)O(n)Simple convex problems, warm-up
BFGSSuperlinearO(n^2)n < 10^4, nonlinear problems
L-BFGSSuperlinearO(mn)n up to 10^6, ML model fine-tuning
NewtonQuadraticO(n^2)n < 10^3, mathematical optimization
Adam/SGDLinear/sublinearO(n)n > 10^6, neural networks (non-convex)

**Fine-tuning LLMs with L-BFGS:** for small LLMs (BERT-base, GPT-2) L-BFGS with gradient checkpointing can outperform Adam on small fine-tuning datasets. The catch: L-BFGS needs full batches-incompatible with stochastic noise from mini-batches.

L-BFGS is better than Adam for neural networks because it uses curvature information

Adam dominates for neural networks in practice because it's compatible with mini-batch SGD. L-BFGS requires accurate full-batch gradients, which is impractical for large datasets.

L-BFGS approximates the Hessian using accumulated gradient history. Stochastic gradients (mini-batch SGD) are too noisy-the Hessian approximation degrades. Adam adapts step sizes per-coordinate using first and second moment estimates, which works well with noisy gradients. That's why Adam dominates deep learning while L-BFGS excels in physics simulation, scientific computing, and fine-tuning on small datasets.

Why does L-BFGS store only m pairs (s, y) instead of the full Hessian approximation?

Key Ideas

  • **Line search** adapts step size: backtracking (Armijo)-fast and simple; Wolfe conditions-convergence guarantee
  • **Newton's method** uses the Hessian for quadratic convergence, but O(n^2) memory and O(n^3) time make it infeasible for large n
  • **BFGS** approximates the inverse Hessian via rank-2 updates O(n^2); superlinear convergence
  • **L-BFGS** stores only m pairs (s, y); O(mn) memory; standard for nonlinear problems up to n~10^6

Related Topics

Second-order methods complement SGD for different regimes:

  • Stochastic Optimization — SGD and Adam are stochastic analogues of GD; L-BFGS is incompatible with mini-batch SGD
  • Constrained Optimization — L-BFGS-B adds box constraints; KKT conditions generalize to general constraints
  • Interior Point Methods — IPM uses Newton steps internally; shared math-inverting the barrier function Hessian

Вопросы для размышления

  • Why doesn't Newton's quadratic convergence make it the best choice for neural network training? Which practical constraints matter more than theoretical convergence speed?
  • L-BFGS stores only the last m Hessian updates. How does this affect approximation quality? What m is appropriate in practice?
  • Which method fits: (a) logistic regression on 10^6 examples, (b) physics simulation with 1000 parameters, (c) training GPT-2?

Связанные уроки

  • calc-01-sequences
Nonlinear Optimization

0

1

Sign In