Scientific Computing

Numerical Optimization

SpaceX's trajectory optimization for Falcon 9 landing: minimize fuel consumed subject to reaching the landing pad without crashing. The constraint space has thousands of variables (thrust vectors over time), nonlinear dynamics, and actuator limits. A sequential quadratic programming solver runs in milliseconds on embedded hardware during descent. The same mathematical framework - Newton's method, quasi-Newton approximations, KKT conditions - appears in protein folding, circuit layout, portfolio management, and every ML hyperparameter search. Numerical optimization is the engine of quantitative science.

**Google's Vizier (Bayesian optimization)**: internal hyperparameter tuning service processing millions of optimization jobs per day; Bayesian optimization reduces tuning cost by 10-20x versus grid search for deep learning models with >10 hyperparameters
**MOSEK/Gurobi (convex optimization)**: commercial solvers used by hedge funds for portfolio optimization (quadratic programming), logistics companies for supply chain (linear programming), and chip designers for physical layout; Gurobi solves problems with 10^6 variables in seconds
**SciPy BFGS (L-BFGS-B)**: used in PyTorch's LBFGS optimizer, scikit-learn's logistic regression, and dozens of scientific computing packages; handles problems up to n=10^6 variables with 10-100 MB memory footprint

Newton's Method for Optimization

Gradient descent moves in the direction of steepest descent with a fixed or line-searched step size. **Newton's method** goes further: it uses the second derivative (Hessian matrix) to account for curvature, taking a step that goes directly to the minimum of the local quadratic approximation. On strongly convex functions, Newton's method converges quadratically - each step roughly squares the error. Gradient descent converges linearly (error decreases by a constant factor each step). For a 1000-variable optimization: Newton might need 20 iterations; gradient descent 10,000.

**Newton's optimization step:** `x_{k+1} = x_k - H(x_k)^{-1} · ∇f(x_k)` where H is the Hessian matrix (n×n matrix of second partial derivatives) and ∇f is the gradient. The step `H^{-1}·∇f` is the **Newton direction** - the exact minimizer of the second-order Taylor approximation. Cost per iteration: O(n^3) for the linear solve (dominant cost for large n).

**Newton's method fails when H is not positive definite**: near saddle points or maxima, the Hessian has negative eigenvalues and Newton's direction may go uphill. Fixes: Levenberg-Marquardt modification (`H + λI`), trust region methods, or switching to gradient descent when curvature is problematic. Deep learning avoids this entirely with SGD/Adam - the Hessian is never computed.

Newton's method has quadratic convergence. What does this mean in practice?

Quasi-Newton Methods: BFGS

Newton's method needs the Hessian evaluated at every step: O(n^2) storage, O(n^3) solve. For n=10,000 parameters, the Hessian is 100 million entries (800 MB) and the solve takes 10^12 operations per step. **Quasi-Newton methods** approximate the inverse Hessian from gradient information alone - no second derivatives required. BFGS (Broyden-Fletcher-Goldfarb-Shanno) maintains a rank-2 update of the inverse Hessian approximation, achieving superlinear convergence at O(n^2) cost.

**BFGS update formula:** Given gradient differences: `y_k = ∇f(x_{k+1}) - ∇f(x_k)` and step `s_k = x_{k+1} - x_k`: `H_{k+1}^{-1} = (I - ρ_k s_k y_k^T) H_k^{-1} (I - ρ_k y_k s_k^T) + ρ_k s_k s_k^T` where `ρ_k = 1/(y_k^T s_k)`. This satisfies the **secant condition**: `H_{k+1} s_k = y_k` - the Hessian approximation is consistent with the observed gradient change. **L-BFGS** (Limited-memory BFGS) stores only the last m (s_k, y_k) pairs instead of the full matrix, reducing memory from O(n^2) to O(mn) - making it tractable for n=10^6 variables. PyTorch's LBFGS optimizer implements this.

Why is L-BFGS preferred over BFGS for large-scale optimization (n > 10,000)?

Constrained Optimization

Most engineering optimization has constraints: minimize fuel consumption subject to safety margins; minimize portfolio risk subject to a return target; find the shortest path subject to avoiding obstacles. Constrained optimization adds equality constraints `g(x) = 0` and inequality constraints `h(x) ≤ 0`. The **KKT conditions** (Karush-Kuhn-Tucker) characterize optimal points; methods like Sequential Quadratic Programming (SQP) and interior point methods solve the KKT system numerically.

**Lagrangian for constrained optimization:** `L(x, λ, μ) = f(x) + λ^T g(x) + μ^T h(x)` **KKT conditions** (necessary for optimality): 1. Stationarity: ∇_x L = 0 2. Primal feasibility: g(x*) = 0, h(x*) ≤ 0 3. Dual feasibility: μ ≥ 0 4. Complementary slackness: μ_i · h_i(x*) = 0 for all i Active inequality constraints (h_i(x*) = 0) act like equality constraints at the optimum.

The complementary slackness KKT condition states μ_i · h_i(x*) = 0. What does this imply for inactive inequality constraints (h_i(x*) < 0)?

Global Optimization

Newton and BFGS find **local** minima - the nearest valley from the starting point. For non-convex problems with many local minima (protein folding, hyperparameter optimization, circuit layout), the global minimum may be far from any starting point. Global optimization is NP-hard in general but practical methods exist: **simulated annealing** (probabilistic hill-climbing), **genetic algorithms** (population-based search), **Bayesian optimization** (smart sequential search using Gaussian process surrogate), and **basin-hopping** (local optimization + random perturbation).

Method	Mechanism	Strengths	Weaknesses
Simulated Annealing	Probabilistic acceptance of worse solutions	Simple, general, theoretical guarantees	Slow cooling required; many evaluations
Genetic Algorithm	Population + crossover + mutation + selection	Parallel search; handles discrete variables	Many hyperparameters; no gradient use
Bayesian Optimization	GP surrogate + acquisition function	Very sample-efficient (10-100 evals)	Scales poorly past ~20 dimensions
Basin-Hopping	Local opt + perturbation, accept if global improves	Uses local optimizers; fast	Problem-specific perturbation design
DIRECT	Dividing rectangles, deterministic	No hyperparameters; guaranteed coverage	Scales poorly to high dimensions

Newton's method is always better than gradient descent because it converges faster

Newton's method is better for smooth, moderate-scale problems; gradient descent (SGD, Adam) is preferred for large-scale machine learning because computing and storing the Hessian is infeasible for millions of parameters

For a neural network with n=10 million parameters, the Hessian is a 10^14-element matrix - impossible to store or invert. Stochastic gradient descent's noise actually helps escape local minima and sharp basins in non-convex deep learning landscapes. Adam approximates diagonal curvature (per-parameter learning rate) at O(n) memory cost. Newton's quadratic convergence advantage only matters when you can afford the Hessian; for deep learning, you cannot.

Bayesian optimization is particularly efficient for hyperparameter tuning because:

Key Ideas

**Newton's method** uses the Hessian (second derivatives) to take exact quadratic steps; quadratic convergence (doubles correct digits each step); O(n^3) cost per iteration makes it impractical for n > 10,000
**BFGS/L-BFGS** approximate the inverse Hessian from gradient differences; L-BFGS stores only m recent (s,y) pairs, reducing memory from O(n^2) to O(mn); superlinear convergence with no Hessian evaluation
**Constrained optimization** (KKT conditions): equality constraints via Lagrange multipliers, inequality constraints via dual variables with μ_i ≥ 0; SQP and interior point methods solve KKT systems numerically
**Global optimization** methods (Bayesian optimization, simulated annealing, genetic algorithms) trade convergence guarantees for wider search; Bayesian optimization is sample-efficient (50 evals) for hyperparameter tuning vs grid search (10,000+)

Вопросы для размышления

Newton's method computes the full Hessian at each step. For a neural network with 175 billion parameters (GPT-3 scale), what is the memory requirement for the Hessian matrix in bytes? Why does this make Newton's method impractical, and what does Adam approximate instead?
Bayesian optimization with a Gaussian process surrogate scales as O(n^3) in the number of evaluations n. What does this imply about its practical limit - and how do sparse GP approximations address this?
The KKT complementary slackness condition is μ_i · h_i(x*) = 0. Explain geometrically why active inequality constraints (h_i = 0) can have non-zero multipliers while inactive constraints (h_i < 0) must have μ_i = 0.

Связанные уроки

nm-05

Scientific Computing

Numerical Optimization

**Google's Vizier (Bayesian optimization)**: internal hyperparameter tuning service processing millions of optimization jobs per day; Bayesian optimization reduces tuning cost by 10-20x versus grid search for deep learning models with >10 hyperparameters
**MOSEK/Gurobi (convex optimization)**: commercial solvers used by hedge funds for portfolio optimization (quadratic programming), logistics companies for supply chain (linear programming), and chip designers for physical layout; Gurobi solves problems with 10^6 variables in seconds
**SciPy BFGS (L-BFGS-B)**: used in PyTorch's LBFGS optimizer, scikit-learn's logistic regression, and dozens of scientific computing packages; handles problems up to n=10^6 variables with 10-100 MB memory footprint

Newton's Method for Optimization

Newton's method has quadratic convergence. What does this mean in practice?

Quasi-Newton Methods: BFGS

Why is L-BFGS preferred over BFGS for large-scale optimization (n > 10,000)?

Constrained Optimization

The complementary slackness KKT condition states μ_i · h_i(x*) = 0. What does this imply for inactive inequality constraints (h_i(x*) < 0)?

Global Optimization

Method	Mechanism	Strengths	Weaknesses
Simulated Annealing	Probabilistic acceptance of worse solutions	Simple, general, theoretical guarantees	Slow cooling required; many evaluations
Genetic Algorithm	Population + crossover + mutation + selection	Parallel search; handles discrete variables	Many hyperparameters; no gradient use
Bayesian Optimization	GP surrogate + acquisition function	Very sample-efficient (10-100 evals)	Scales poorly past ~20 dimensions
Basin-Hopping	Local opt + perturbation, accept if global improves	Uses local optimizers; fast	Problem-specific perturbation design
DIRECT	Dividing rectangles, deterministic	No hyperparameters; guaranteed coverage	Scales poorly to high dimensions

Newton's method is always better than gradient descent because it converges faster

Bayesian optimization is particularly efficient for hyperparameter tuning because:

Key Ideas

**Newton's method** uses the Hessian (second derivatives) to take exact quadratic steps; quadratic convergence (doubles correct digits each step); O(n^3) cost per iteration makes it impractical for n > 10,000
**BFGS/L-BFGS** approximate the inverse Hessian from gradient differences; L-BFGS stores only m recent (s,y) pairs, reducing memory from O(n^2) to O(mn); superlinear convergence with no Hessian evaluation
**Constrained optimization** (KKT conditions): equality constraints via Lagrange multipliers, inequality constraints via dual variables with μ_i ≥ 0; SQP and interior point methods solve KKT systems numerically
**Global optimization** methods (Bayesian optimization, simulated annealing, genetic algorithms) trade convergence guarantees for wider search; Bayesian optimization is sample-efficient (50 evals) for hyperparameter tuning vs grid search (10,000+)

Вопросы для размышления

Newton's method computes the full Hessian at each step. For a neural network with 175 billion parameters (GPT-3 scale), what is the memory requirement for the Hessian matrix in bytes? Why does this make Newton's method impractical, and what does Adam approximate instead?
Bayesian optimization with a Gaussian process surrogate scales as O(n^3) in the number of evaluations n. What does this imply about its practical limit - and how do sparse GP approximations address this?
The KKT complementary slackness condition is μ_i · h_i(x*) = 0. Explain geometrically why active inequality constraints (h_i = 0) can have non-zero multipliers while inactive constraints (h_i < 0) must have μ_i = 0.

Связанные уроки

nm-05

Numerical Optimization

Newton's Method for Optimization

Quasi-Newton Methods: BFGS

Constrained Optimization

Global Optimization

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Numerical Optimization

Newton's Method for Optimization

Quasi-Newton Methods: BFGS

Constrained Optimization

Global Optimization

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки