Scientific Computing

Numerical Optimization

SpaceX's trajectory optimization for Falcon 9 landing: minimize fuel consumed subject to reaching the landing pad without crashing. The constraint space has thousands of variables (thrust vectors over time), nonlinear dynamics, and actuator limits. A sequential quadratic programming solver runs in milliseconds on embedded hardware during descent. The same mathematical framework - Newton's method, quasi-Newton approximations, KKT conditions - appears in protein folding, circuit layout, portfolio management, and every ML hyperparameter search. Numerical optimization is the engine of quantitative science.

  • **Google's Vizier (Bayesian optimization)**: internal hyperparameter tuning service processing millions of optimization jobs per day; Bayesian optimization reduces tuning cost by 10-20x versus grid search for deep learning models with >10 hyperparameters
  • **MOSEK/Gurobi (convex optimization)**: commercial solvers used by hedge funds for portfolio optimization (quadratic programming), logistics companies for supply chain (linear programming), and chip designers for physical layout; Gurobi solves problems with 10^6 variables in seconds
  • **SciPy BFGS (L-BFGS-B)**: used in PyTorch's LBFGS optimizer, scikit-learn's logistic regression, and dozens of scientific computing packages; handles problems up to n=10^6 variables with 10-100 MB memory footprint

Newton's Method for Optimization

Gradient descent moves in the direction of steepest descent with a fixed or line-searched step size. **Newton's method** goes further: it uses the second derivative (Hessian matrix) to account for curvature, taking a step that goes directly to the minimum of the local quadratic approximation. On strongly convex functions, Newton's method converges quadratically - each step roughly squares the error. Gradient descent converges linearly (error decreases by a constant factor each step). For a 1000-variable optimization: Newton might need 20 iterations; gradient descent 10,000.

**Newton's optimization step:** `x_{k+1} = x_k - H(x_k)^{-1} · ∇f(x_k)` where H is the Hessian matrix (n×n matrix of second partial derivatives) and ∇f is the gradient. The step `H^{-1}·∇f` is the **Newton direction** - the exact minimizer of the second-order Taylor approximation. Cost per iteration: O(n^3) for the linear solve (dominant cost for large n).

**Newton's method fails when H is not positive definite**: near saddle points or maxima, the Hessian has negative eigenvalues and Newton's direction may go uphill. Fixes: Levenberg-Marquardt modification (`H + λI`), trust region methods, or switching to gradient descent when curvature is problematic. Deep learning avoids this entirely with SGD/Adam - the Hessian is never computed.

Newton's method has quadratic convergence. What does this mean in practice?

Quasi-Newton Methods: BFGS

Newton's method needs the Hessian evaluated at every step: O(n^2) storage, O(n^3) solve. For n=10,000 parameters, the Hessian is 100 million entries (800 MB) and the solve takes 10^12 operations per step. **Quasi-Newton methods** approximate the inverse Hessian from gradient information alone - no second derivatives required. BFGS (Broyden-Fletcher-Goldfarb-Shanno) maintains a rank-2 update of the inverse Hessian approximation, achieving superlinear convergence at O(n^2) cost.

**BFGS update formula:** Given gradient differences: `y_k = ∇f(x_{k+1}) - ∇f(x_k)` and step `s_k = x_{k+1} - x_k`: `H_{k+1}^{-1} = (I - ρ_k s_k y_k^T) H_k^{-1} (I - ρ_k y_k s_k^T) + ρ_k s_k s_k^T` where `ρ_k = 1/(y_k^T s_k)`. This satisfies the **secant condition**: `H_{k+1} s_k = y_k` - the Hessian approximation is consistent with the observed gradient change. **L-BFGS** (Limited-memory BFGS) stores only the last m (s_k, y_k) pairs instead of the full matrix, reducing memory from O(n^2) to O(mn) - making it tractable for n=10^6 variables. PyTorch's LBFGS optimizer implements this.

Why is L-BFGS preferred over BFGS for large-scale optimization (n > 10,000)?

Constrained Optimization

Most engineering optimization has constraints: minimize fuel consumption subject to safety margins; minimize portfolio risk subject to a return target; find the shortest path subject to avoiding obstacles. Constrained optimization adds equality constraints `g(x) = 0` and inequality constraints `h(x) ≤ 0`. The **KKT conditions** (Karush-Kuhn-Tucker) characterize optimal points; methods like Sequential Quadratic Programming (SQP) and interior point methods solve the KKT system numerically.

**Lagrangian for constrained optimization:** `L(x, λ, μ) = f(x) + λ^T g(x) + μ^T h(x)` **KKT conditions** (necessary for optimality): 1. Stationarity: ∇_x L = 0 2. Primal feasibility: g(x*) = 0, h(x*) ≤ 0 3. Dual feasibility: μ ≥ 0 4. Complementary slackness: μ_i · h_i(x*) = 0 for all i Active inequality constraints (h_i(x*) = 0) act like equality constraints at the optimum.

The complementary slackness KKT condition states μ_i · h_i(x*) = 0. What does this imply for inactive inequality constraints (h_i(x*) < 0)?

Global Optimization

Newton and BFGS find **local** minima - the nearest valley from the starting point. For non-convex problems with many local minima (protein folding, hyperparameter optimization, circuit layout), the global minimum may be far from any starting point. Global optimization is NP-hard in general but practical methods exist: **simulated annealing** (probabilistic hill-climbing), **genetic algorithms** (population-based search), **Bayesian optimization** (smart sequential search using Gaussian process surrogate), and **basin-hopping** (local optimization + random perturbation).

MethodMechanismStrengthsWeaknesses
Simulated AnnealingProbabilistic acceptance of worse solutionsSimple, general, theoretical guaranteesSlow cooling required; many evaluations
Genetic AlgorithmPopulation + crossover + mutation + selectionParallel search; handles discrete variablesMany hyperparameters; no gradient use
Bayesian OptimizationGP surrogate + acquisition functionVery sample-efficient (10-100 evals)Scales poorly past ~20 dimensions
Basin-HoppingLocal opt + perturbation, accept if global improvesUses local optimizers; fastProblem-specific perturbation design
DIRECTDividing rectangles, deterministicNo hyperparameters; guaranteed coverageScales poorly to high dimensions

Newton's method is always better than gradient descent because it converges faster

Newton's method is better for smooth, moderate-scale problems; gradient descent (SGD, Adam) is preferred for large-scale machine learning because computing and storing the Hessian is infeasible for millions of parameters

For a neural network with n=10 million parameters, the Hessian is a 10^14-element matrix - impossible to store or invert. Stochastic gradient descent's noise actually helps escape local minima and sharp basins in non-convex deep learning landscapes. Adam approximates diagonal curvature (per-parameter learning rate) at O(n) memory cost. Newton's quadratic convergence advantage only matters when you can afford the Hessian; for deep learning, you cannot.

Bayesian optimization is particularly efficient for hyperparameter tuning because:

Key Ideas

  • **Newton's method** uses the Hessian (second derivatives) to take exact quadratic steps; quadratic convergence (doubles correct digits each step); O(n^3) cost per iteration makes it impractical for n > 10,000
  • **BFGS/L-BFGS** approximate the inverse Hessian from gradient differences; L-BFGS stores only m recent (s,y) pairs, reducing memory from O(n^2) to O(mn); superlinear convergence with no Hessian evaluation
  • **Constrained optimization** (KKT conditions): equality constraints via Lagrange multipliers, inequality constraints via dual variables with μ_i ≥ 0; SQP and interior point methods solve KKT systems numerically
  • **Global optimization** methods (Bayesian optimization, simulated annealing, genetic algorithms) trade convergence guarantees for wider search; Bayesian optimization is sample-efficient (50 evals) for hyperparameter tuning vs grid search (10,000+)

Related Topics

Numerical optimization connects scientific computing to machine learning and engineering:

  • Monte Carlo Methods — Stochastic optimization (SGD, simulated annealing) uses randomness similarly to Monte Carlo; both trade determinism for scalability
  • Numerical Linear Algebra — Newton's method requires solving linear systems Hd = -g; sparse Hessian structure enables efficient solvers for large-scale problems
  • Machine Learning Optimization (Adam, SGD) — Adam is a quasi-Newton variant with diagonal Hessian approximation; understanding Newton's method explains why Adam's per-parameter learning rates work

Вопросы для размышления

  • Newton's method computes the full Hessian at each step. For a neural network with 175 billion parameters (GPT-3 scale), what is the memory requirement for the Hessian matrix in bytes? Why does this make Newton's method impractical, and what does Adam approximate instead?
  • Bayesian optimization with a Gaussian process surrogate scales as O(n^3) in the number of evaluations n. What does this imply about its practical limit - and how do sparse GP approximations address this?
  • The KKT complementary slackness condition is μ_i · h_i(x*) = 0. Explain geometrically why active inequality constraints (h_i = 0) can have non-zero multipliers while inactive constraints (h_i < 0) must have μ_i = 0.

Связанные уроки

  • nm-05
Numerical Optimization

0

1

Sign In