Convex Optimization
Theoretical Convergence Guarantees
Why do we trust that a neural network has trained? Why is Nesterov faster than GD, and this is not empiricism but mathematical fact? Why does Google use Adagrad instead of SGD? Convergence theory is not dry mathematics, it is **reliability guarantees** and a **roadmap** for building faster algorithms.
- **TensorFlow and PyTorch**: choosing an optimizer (Adam, SGD, Adagrad) is applied convergence theory; Adam is justified as an approximate natural gradient with adaptive κ per parameter
- **LLM training**: GPT-4 is trained with a cosine lr schedule, not empiricism but a consequence of theory: decaying lr is necessary for SGD convergence; cosine approximates the optimal 1/k decay
- **Certifying ML systems**: in autonomous vehicles and medical devices, guarantees are required, convergence theorems give mathematically rigorous statements about algorithm behavior
Convergence Rates: GD, Nesterov and Strongly Convex
On L-smooth convex problems gradient descent gives an O(1/k) gap; Nesterov's 1983 accelerated method gives O(1/k²). For training a 175-billion-parameter model, that quadratic gap is the difference between 1 epoch and 1000. Convex optimization theory classifies algorithms by **convergence rate**: how quickly the gap f(xₖ) - f* decreases. The rate depends on the function class: convex, strongly convex, smooth (L-Lipschitz gradient) or not. Understanding these classes explains why Nesterov beats GD and why Adam outperforms SGD in practice.
Distinguish between **sublinear** O(1/k) and O(1/k²) convergence and **linear** e^{-αk} convergence. Linear means accuracy doubles every fixed number of iterations, this is exponentially fast in real time. For strongly convex functions, both methods (GD and Nesterov) give linear convergence, but with different constants: κ vs √κ.
The Polyak-Łojasiewicz (PL) condition: ½‖∇f(x)‖² ≥ μ(f(x) - f*). This is weaker than strong convexity, yet sufficient for linear convergence of GD! A function may be non-convex but satisfy PL, for example, softmax cross-entropy for sufficiently wide networks. This explains why GD converges linearly on overparameterized neural networks, even though the problem is non-convex.
Why does Nesterov's method give √κ-fold speedup over GD for strongly convex functions, not κ-fold?
Complexity Lower Bounds: Nemirovsky-Yudin Theorem
Knowing the convergence rates of specific algorithms gives upper bounds. But how fast can the **best possible** first-order algorithm be? **Complexity lower bounds** answer this: no first-order method can converge faster than a certain limit on the worst case.
The Nemirovsky-Yudin lower bounds apply to **black-box** first-order methods, algorithms that only know f(x) and ∇f(x) at queried points. If the problem has **additional structure** (sparsity, separability, a special graph), methods faster than the lower bound exist. Examples: coordinate descent for separable problems, ADMM for structured problems. The lower bounds constrain 'blind' methods, not all algorithms.
Why is standard GD (rate O(1/k)) not optimal for convex problems?
SGD Convergence: Convex and Strongly Convex Cases
In machine learning we use **stochastic gradient descent (SGD)** instead of the full gradient. This introduces variance: gₜ = ∇fᵢₜ(xₜ), an unbiased but noisy estimate of ∇f(xₜ). SGD theory analyzes how this noise affects convergence speed and what can be done about it.
Paradoxically: in ML, stochastic SGD noise often **helps** rather than just hurts. Noise acts as implicit regularization: SGD avoids sharp local minima with poor generalization, preferring flat ones (flat minima). Empirically: models trained with SGD generalize better than those trained with full GD at the same training accuracy. This 'noise as regularizer' is an active research area.
SGD for strongly convex problems converges at O(1/k), while deterministic GD converges linearly (e^{-k/κ}). Why not just use variance reduction (SVRG) instead of GD?
Acceleration Theory: Momentum, Catalyst and Optimality
Nesterov's method (1983) is mysterious: why "overshoot" the update point? Why does it work? Three interpretations: momentum (inertia), polynomial approximation, and geometric (estimate sequence). Modern theory provides a unified framework explaining acceleration through lower bound matching.
Catalyst shows that **acceleration is an outer wrapper** applicable to any algorithm. This is similar to how Adam is not a new algorithm but SGD with adaptive scaling. Modern understanding: Nesterov's acceleration is the optimal way to 'accumulate' gradient information in black-box first-order methods. Nemirovsky-Yudin lower bounds + Nesterov upper bounds = the complete picture of optimality.
Catalyst framework accelerates SVRG from O(n + nκ) to O(n + √(nκ)·log(1/ε)). How is this possible if Nesterov's lower bound is already tight?
Key Ideas
- **Convergence rates**: GD = O(1/k) for convex, O(e^{-k/κ}) for strongly convex; Nesterov = O(1/k²) and O(e^{-k/√κ}), optimal
- **Nemirovsky-Yudin lower bounds**: Ω(1/k²) for convex, Ω(e^{-k/√κ}) for strongly convex, no black-box first-order method is faster
- **SGD convergence**: O(1/√k) for convex, O(1/k) for strongly convex; Variance Reduction (SVRG, SAGA) restores linear rate
- **Catalyst**: universal acceleration framework for any first-order method to reach optimal finite-sum complexity
Related Topics
Convergence theory is the foundation of the entire convex optimization course:
- Optimization on Manifolds — Riemannian GD converges at O(1/k) for geodesically convex functions, the Riemannian analogue of convergence theorems
- Optimization in Reinforcement Learning — TRPO and PPO have theoretical monotonic policy improvement guarantees, consequences of trust region and surrogate objective
- Federated Optimization — FedAvg and FedProx convergence theorems are direct consequences of SGD theory with heterogeneous loss functions
Вопросы для размышления
- Why do Nemirovsky-Yudin lower bounds apply only to black-box methods, and how does structural knowledge about a problem help circumvent them?
- If Nesterov's method is optimal, why does Adam outperform it in practice for neural network training?
- Can randomization fundamentally change complexity lower bounds, or are they identical for deterministic and randomized methods?