Optimization

Nonlinear Optimization

Training GPT takes millions of Adam iterations, while a physics simulation problem is solved by L-BFGS in dozens. The choice of optimizer is not a detail-it's an architectural decision. Understanding second-order methods explains why Adam and Newton are not interchangeable.

**scipy.optimize.minimize:** standard for scientific computing; L-BFGS-B solves problems with thousands of parameters in seconds
**PyTorch LBFGS:** used for physics-informed neural networks (PINN), style transfer, fine-tuning small models on small datasets
**Second-order at Google:** K-FAC (Kronecker-Factored Approximate Curvature) approximates the Fisher matrix for transformer training-applied second-order optimization at scale

Предварительные знания

Interior Point Methods

Gradient Descent and Line Search

Gradient descent: xk+1 = xk - alpha_k * nabla f(xk). The critical choice is the step size alpha_k. Too small-slow convergence. Too large-divergence. **Line search** solves this: it finds alpha_k satisfying the Wolfe conditions (sufficient decrease + curvature condition).

**Wolfe conditions** are the standard for line search: 1. Armijo/sufficient decrease: f drops enough 2. curvature: the step isn't too small (gradient still has downward component). Together they guarantee convergence of gradient descent to a local optimum.

Step strategy	Pros	Cons
Fixed alpha	Simple	Needs tuning; may diverge
Backtracking (Armijo)	Simple to implement	Only decrease condition
Wolfe conditions	Convergence guarantee	More expensive; two conditions
Exact line search	Optimal step	Very expensive; rarely needed

Why is backtracking line search preferable to a fixed step size in gradient descent?

Newton's Method

Newton's method uses both the gradient and **curvature** (Hessian): xk+1 = xk - [nabla^2 f(xk)]^{-1} nabla f(xk). Geometrically: build a quadratic approximation and jump to its minimum. This gives **quadratic convergence**: ||xk+1 - x*|| <= C * ||xk - x*||^2.

**Newton's method bottleneck:** computing and storing the Hessian costs O(n^2) memory and O(n^3) time for inversion. For a neural network with 10^8 parameters this is physically impossible. Solution: quasi-Newton methods (BFGS, L-BFGS) that approximate the Hessian without computing it.

Newton's method has quadratic convergence but is rarely used for neural network training. Why?

BFGS and Quasi-Newton

**BFGS** (Broyden-Fletcher-Goldfarb-Shanno, 1970) approximates the inverse Hessian Hk using only gradients. It uses a rank-2 update: Hk+1 = Hk + (rank-2 correction from sk and yk), where sk = xk+1-xk, yk = nabla f(xk+1)-nabla f(xk).

**BFGS rank-2 update:** Hk+1 = (I - rho_k*sk*yk^T)*Hk*(I - rho_k*yk*sk^T) + rho_k*sk*sk^T, where rho_k = 1/(yk^T*sk). The update costs O(n^2) (vector-vector products), not O(n^3) for Hessian inversion. Hk remains positive definite when Wolfe conditions hold.

What is the fundamental difference between BFGS and Newton's method?

L-BFGS: Limited Memory

**L-BFGS** (Limited-memory BFGS) scales BFGS to large n. Instead of storing Hk (O(n^2) memory), it keeps only the last m pairs (sk, yk). The update uses a two-loop recursion costing O(mn) instead of O(n^2).

Method	Convergence	Memory	When to use
GD	Linear O(1/k)	O(n)	Simple convex problems, warm-up
BFGS	Superlinear	O(n^2)	n < 10^4, nonlinear problems
L-BFGS	Superlinear	O(mn)	n up to 10^6, ML model fine-tuning
Newton	Quadratic	O(n^2)	n < 10^3, mathematical optimization
Adam/SGD	Linear/sublinear	O(n)	n > 10^6, neural networks (non-convex)

**Fine-tuning LLMs with L-BFGS:** for small LLMs (BERT-base, GPT-2) L-BFGS with gradient checkpointing can outperform Adam on small fine-tuning datasets. The catch: L-BFGS needs full batches-incompatible with stochastic noise from mini-batches.

L-BFGS is better than Adam for neural networks because it uses curvature information

Adam dominates for neural networks in practice because it's compatible with mini-batch SGD. L-BFGS requires accurate full-batch gradients, which is impractical for large datasets.

L-BFGS approximates the Hessian using accumulated gradient history. Stochastic gradients (mini-batch SGD) are too noisy-the Hessian approximation degrades. Adam adapts step sizes per-coordinate using first and second moment estimates, which works well with noisy gradients. That's why Adam dominates deep learning while L-BFGS excels in physics simulation, scientific computing, and fine-tuning on small datasets.

Why does L-BFGS store only m pairs (s, y) instead of the full Hessian approximation?

Key Ideas

**Line search** adapts step size: backtracking (Armijo)-fast and simple; Wolfe conditions-convergence guarantee
**Newton's method** uses the Hessian for quadratic convergence, but O(n^2) memory and O(n^3) time make it infeasible for large n
**BFGS** approximates the inverse Hessian via rank-2 updates O(n^2); superlinear convergence
**L-BFGS** stores only m pairs (s, y); O(mn) memory; standard for nonlinear problems up to n~10^6

Вопросы для размышления

Why doesn't Newton's quadratic convergence make it the best choice for neural network training? Which practical constraints matter more than theoretical convergence speed?
L-BFGS stores only the last m Hessian updates. How does this affect approximation quality? What m is appropriate in practice?
Which method fits: (a) logistic regression on 10^6 examples, (b) physics simulation with 1000 parameters, (c) training GPT-2?

Связанные уроки

calc-01-sequences

Optimization

Nonlinear Optimization

**scipy.optimize.minimize:** standard for scientific computing; L-BFGS-B solves problems with thousands of parameters in seconds
**PyTorch LBFGS:** used for physics-informed neural networks (PINN), style transfer, fine-tuning small models on small datasets
**Second-order at Google:** K-FAC (Kronecker-Factored Approximate Curvature) approximates the Fisher matrix for transformer training-applied second-order optimization at scale

Предварительные знания

Interior Point Methods

Gradient Descent and Line Search

Step strategy	Pros	Cons
Fixed alpha	Simple	Needs tuning; may diverge
Backtracking (Armijo)	Simple to implement	Only decrease condition
Wolfe conditions	Convergence guarantee	More expensive; two conditions
Exact line search	Optimal step	Very expensive; rarely needed

Why is backtracking line search preferable to a fixed step size in gradient descent?

Newton's Method

Newton's method has quadratic convergence but is rarely used for neural network training. Why?

BFGS and Quasi-Newton

What is the fundamental difference between BFGS and Newton's method?

L-BFGS: Limited Memory

Method	Convergence	Memory	When to use
GD	Linear O(1/k)	O(n)	Simple convex problems, warm-up
BFGS	Superlinear	O(n^2)	n < 10^4, nonlinear problems
L-BFGS	Superlinear	O(mn)	n up to 10^6, ML model fine-tuning
Newton	Quadratic	O(n^2)	n < 10^3, mathematical optimization
Adam/SGD	Linear/sublinear	O(n)	n > 10^6, neural networks (non-convex)

L-BFGS is better than Adam for neural networks because it uses curvature information

Adam dominates for neural networks in practice because it's compatible with mini-batch SGD. L-BFGS requires accurate full-batch gradients, which is impractical for large datasets.

Why does L-BFGS store only m pairs (s, y) instead of the full Hessian approximation?

Key Ideas

**Line search** adapts step size: backtracking (Armijo)-fast and simple; Wolfe conditions-convergence guarantee
**Newton's method** uses the Hessian for quadratic convergence, but O(n^2) memory and O(n^3) time make it infeasible for large n
**BFGS** approximates the inverse Hessian via rank-2 updates O(n^2); superlinear convergence
**L-BFGS** stores only m pairs (s, y); O(mn) memory; standard for nonlinear problems up to n~10^6

Вопросы для размышления

Why doesn't Newton's quadratic convergence make it the best choice for neural network training? Which practical constraints matter more than theoretical convergence speed?
L-BFGS stores only the last m Hessian updates. How does this affect approximation quality? What m is appropriate in practice?
Which method fits: (a) logistic regression on 10^6 examples, (b) physics simulation with 1000 parameters, (c) training GPT-2?

Связанные уроки

calc-01-sequences

Nonlinear Optimization

Предварительные знания

Gradient Descent and Line Search

Newton's Method

BFGS and Quasi-Newton

L-BFGS: Limited Memory

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Nonlinear Optimization

Предварительные знания

Gradient Descent and Line Search

Newton's Method

BFGS and Quasi-Newton

L-BFGS: Limited Memory

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки