Optimization

Introduction to Optimization

GPT-4 doesn't "think". It descends a mountain in a space of 1.7 trillion dimensions. The Adam optimizer finds the bottom over 3 months on 25,000 H100 GPUs. Cost of one descent: 100 million dollars. Gradient descent was invented by Cauchy in 1847 - he was minimizing functions of one variable. 177 years later, the same idea trains GPT-4. UPS routes for 60,000 drivers. Airline revenue management. Markowitz portfolios. One mathematics, different scales.

  • **SGD / Adam in neural networks** - every trained model from GPT-4 to ResNet is the result of optimizing a loss function; Adam uses exponential moving moments of gradients
  • **UPS logistics** - the ORION system optimizes routes for 60,000 drivers daily; saves 100 million miles per year
  • **Airline revenue management** - airlines solve real-time pricing as an optimization problem, balancing load and revenue
  • **Markowitz portfolio** - hedge funds optimize portfolios balancing return and risk; this is a QP problem with constraints

Objective Function

GPT-4 doesn't "think". It descends a mountain in a space of 1.7 trillion dimensions. The Adam optimizer finds the bottom over 3 months on 25,000 H100 GPUs. The cost of one such descent: 100 million dollars. Gradient descent was invented by Cauchy in 1847 - he was minimizing functions of a single variable. 177 years later, the same idea trains GPT-4. **Optimization** is the search for input values at which an objective function takes its best value.

The **objective function** f(x) is what we want to minimize (min) or maximize (max). The argument x can be a number, a vector, or even a function. Any maximization problem can be converted to minimization: max f(x) = min(-f(x)).

DomainObjective FunctionWhat We Optimize
Machine LearningLoss function (MSE, Cross-Entropy)Minimize model error
UPS LogisticsRoute costMinimize delivery expenses
Finance (Markowitz)Portfolio riskMaximize return at fixed risk
EngineeringStructure weightMinimize material while preserving strength

The key point: optimization is not about "finding an answer" - it's about finding the **best** answer. A calculator solves an equation. An optimizer finds an extremum. The Adam optimizer running for 3 months on 25,000 GPUs is the same principle as minimizing a single-variable function from Cauchy's 1847 textbook.

A neural network trains by minimizing a loss function. What is the argument x of the objective function?

Constraints and Feasible Set

In the real world, optimization always happens within constraints. Budgets are finite, time is limited, the laws of physics cannot be circumvented. **Constraints** define the region in which a solution is searched - the **feasible set**. UPS optimizes routes for 60,000 drivers with one hard constraint: a driver works no more than 10 hours. That's constrained optimization at industrial scale.

**Equality constraints** h(x) = 0 define a surface in parameter space (e.g., "weights sum to 1"). **Inequality constraints** g(x) ≤ 0 define a region (e.g., "each stock weight between 0 and 1"). The intersection of all constraints is the **feasible set**.

The optimum of a constrained problem often lies **on the boundary** of the feasible set, not inside it. Therefore, simply setting the gradient to zero is not enough - special methods are needed (Lagrange multipliers, KKT conditions).

An **unconstrained** problem is a special case where the feasible set is the entire space. Gradient descent in ML usually solves exactly this kind of problem (though regularization can be viewed as a soft constraint).

A factory produces chairs (x₁) and tables (x₂). Constraint: x₁ + 2x₂ ≤ 100 (working hours). What type of constraint is this?

Local vs Global Optimum

A mountain landscape. A valley surrounded by uphill slopes on all sides is a **local minimum**. But somewhere beyond the mountains lies an even deeper valley - the **global minimum**. How is it possible to know whether the true bottom has been found without surveying the entire landscape? This is exactly the question every engineer faces when looking at a neural network's loss curve.

A **local minimum** is a point x* where f(x*) ≤ f(x) for all x in some neighborhood. A **global minimum** is a point x* where f(x*) ≤ f(x) for **all** feasible x. For an arbitrary function, finding the global minimum is an NP-hard problem.

The saving property is **convexity**. If a function is convex (a "bowl" shape), then any local minimum is automatically the global minimum. No traps, no deception. That is why convex optimization is the gold standard: the solution is guaranteed to be optimal. Linear regression with MSE is convex - Adam finds the global minimum. Neural networks are non-convex, and that distinction is fundamental.

Convexity and Machine Learning

Linear regression with MSE loss is a convex problem, so gradient descent is guaranteed to find the global minimum. Neural networks are non-convex, and engineers rely on heuristics (random initialization, Adam) hoping for a "good enough" local minimum. Research from 2014-2015 (Dauphin, Choromanska) showed: in high-dimensional spaces, bad local minima are rare. Most critical points are saddle points, not traps. Gradient descent with momentum can escape them.

Function f(x) = x⁴ - 2x² + 1. What can be said about its extrema?

Classification of Optimization Problems

Not all optimization problems are the same. The type of problem determines which solution method is applicable and how efficient it is. A route as continuous coordinates - one kind of problem. Choosing from a finite set of warehouses - an entirely different one. Airline revenue management - a third. Training GPT-4 with 1.7 trillion parameters - a fourth.

TypeObjective FunctionConstraintsComplexityExample
Linear (LP)Linear: c^TxLinear: Ax ≤ bPolynomialResource allocation
Quadratic (QP)Quadratic: x^TQx + c^TxLinearPolynomial (if Q ≥ 0)Markowitz portfolio
ConvexConvex functionConvex setPolynomialSVM, Ridge regression
NonlinearArbitrary smoothArbitraryProblem-dependentNeural network training
CombinatorialOver discrete setDiscreteOften NP-hardTSP, scheduling

**Key principle:** the more structure a problem has, the more efficient the method. LP is solvable in polynomial time. For a convex problem, gradient descent finds the global optimum. For nonlinear - only local. For combinatorial - optimality often must be sacrificed for speed.

In machine learning most problems are **nonlinear** (neural network training), but individual components can be convex (linear regression, SVM). Understanding the problem type determines the choice of algorithm and expectations about the result.

Why does deep learning work despite non-convexity?

For a long time, the non-convexity of neural network loss functions was considered a fatal problem. But research from 2014-2015 (Dauphin, Choromanska) showed: in high-dimensional spaces, bad local minima are rare. Most critical points are saddle points, not traps. Gradient descent with momentum is able to escape them. That is why the Adam optimizer works in practice - even though the theory offers no guarantees.

Gradient descent always finds the global minimum

Gradient descent finds the global minimum only for convex functions. For non-convex functions - only a local minimum.

Gradient descent follows the direction of steepest descent and stops when the gradient is close to zero. For a convex function such a point is unique and global. For a non-convex function, gradient descent gets stuck at the nearest local minimum or saddle point, unaware of better solutions elsewhere. Adam adds adaptive momentum, which helps escape saddle points - but provides no global optimality guarantee.

The Travelling Salesman Problem (TSP): find the shortest route through N cities. What type does it belong to?

Key Ideas

  • **Objective function** f(x) - what is minimized or maximized; x - the parameters being tuned; Cauchy 1847 → Adam optimizer, 1.7 trillion parameters
  • **Constraints** (equalities and inequalities) define the feasible set - the search region; the optimum often lies on the boundary
  • **Convexity** - the key property: for convex functions local minimum = global minimum, gradient descent guaranteed to work; neural networks are non-convex, but that's manageable in high dimensions
  • The problem type (LP, QP, convex, nonlinear, combinatorial) determines the choice of method and result guarantees

Related Topics

Optimization is the foundation for many disciplines:

  • Gradient Descent — The main algorithm for solving continuous optimization problems
  • Linear Programming — The most important special case with polynomial algorithms
  • Machine Learning — Training a model = optimizing a loss function

Вопросы для размышления

  • Why are engineers satisfied with a local minimum in deep learning, even though the global minimum is theoretically better?
  • Can any maximization problem be reduced to minimization? Can a constrained problem be reduced to an unconstrained one?
  • If all optimization problems were convex, how would that change computer science?

Связанные уроки

  • opt-02 — Gradient descent is the core algorithm for solving continuous optimization problems - the natural continuation.
  • ml-01-intro — ML training is direct optimization: loss function minimization via gradient descent.
  • calc-06-derivative-intro — The derivative is the mathematical foundation of gradient descent; f'(x)=0 identifies critical points.
  • sci-01 — Numerical methods and optimization are tightly coupled: AlphaFold2 applies gradient descent in protein chemical configuration space.
  • prob-01-intro — Stochastic optimization (SGD) uses random samples - the same idea as Monte Carlo: randomness as a way to explore a large space efficiently.
  • calc-19-gradient
Introduction to Optimization

0

1

Sign In