Convex Optimization

Stochastic Optimization and Adaptive Methods

How can a neural network be trained on a billion examples when a single pass over all data takes hours and the full gradient does not fit in memory?

  • GPT-4: trained on 25,000 GPUs for 90 days with 10^23 operations - only Adam makes this scale feasible through per-coordinate adaptive step sizes
  • YouTube recommendations: the ranking model updates online on a stream of 500 million new views per day via mini-batch SGD
  • Financial HFT: stochastic approximation estimates risk parameters in real time from a stream of exchange ticks
  • Medical federated learning: training a cancer-detection model across 50 hospitals via local SGD without sharing patients

Предварительные знания

  • Gradient descent
  • Convex functions and L-smoothness
  • Probability theory
  • Gradient Descent Methods
  • Convex Functions
  • Expectation and Variance

SGD Foundations: stochastic gradient

Stochastic gradient descent (Robbins-Monro 1951) is the foundation of modern deep learning. Instead of the exact gradient F(theta) = (1/N) * sum f_i(theta), each step uses an estimate on a random mini-batch. This reduces per-iteration cost from O(N) to O(1) (or O(batch)), enabling training of models with millions of parameters on petabyte-scale data.

Mini-batch SGD is standard practice: instead of a single term the gradient averages B observations. This shrinks the noise variance by a factor B, supports parallel GPU computation, and keeps the per-iteration cost reasonable. Typical B ranges from 32 to 1024 depending on model and hardware.

In the nonconvex setting (typical for deep learning) SGD converges only to a stationary point, not necessarily a global optimum. Empirically the stochasticity helps escape poor local minima and saddle points - one reason SGD succeeds in deep learning.

Why is SGD preferred over full gradient descent when N is large?

Variance Reduction Methods

SGD's main limitation is gradient noise that slows convergence. Variance reduction methods (SVRG, SAGA, Katyusha) combine the stochastic advantage with deterministic-style rates. The idea is to periodically refresh a full-gradient reference point and use it as a control variate that shrinks the stochastic estimator's variance.

Katyusha (Allen-Zhu 2017) is an accelerated VR method with optimal complexity O((n + sqrt(n * kappa)) * log(1/eps)) for strongly convex problems. It hits the theoretical lower bound for first-order stochastic methods.

What is the main advantage of VR methods over plain SGD?

Adaptive Methods: AdaGrad, Adam, AdamW

Adaptive methods auto-tune per-coordinate step sizes from the gradient history. AdaGrad (Duchi 2011) introduced the idea; RMSProp (Hinton 2012) and Adam (Kingma 2014) refined it and became the deep-learning default. AdamW (Loshchilov 2017) fixed its interaction with regularization. These methods dominate transformer training - GPT, BERT, PaLM all use Adam variants.

AdamW decouples L2 regularization from the gradient step: instead of g_k + lambda*theta the update becomes theta_{k+1} = (1 - lambda*alpha)*theta_k - alpha*update. This fixes Adam's pathology of unevenly scaling L2 through v_k and empirically improves generalization on large models.

Despite ubiquity, Adam does not converge on every convex problem (Reddi 2018) - counterexamples exist. AMSGrad fixes this by enforcing monotone v_k. In practice the issue is rare and vanilla Adam stays the default.

Modern optimizers (Lion, Sophia, Shampoo) extend Adam's ideas. Lion (2023) uses only the gradient sign and cuts memory in half with comparable performance. Sophia (2024) uses an approximate Hessian for step adaptation and delivers 2x speedup in LLM training. These methods are under active industry evaluation.

Which AdaGrad issue does Adam fix through the exponential moving average?

Key ideas

  • SGD replaces the full gradient with an unbiased mini-batch estimate; cost per iteration O(b) vs. O(n)
  • Robbins-Monro condition (sum of steps = inf, sum of squares < inf) guarantees a.s. convergence to the optimum
  • SGD rate: O(1/sqrt(K)) for convex problems - the optimal lower bound for first-order stochastic oracle methods
  • Adam adapts the step size per coordinate via first (m) and second (v) moment estimates with bias correction
  • Variance reduction (SVRG, SARAH) eliminates the O(1/sqrt(K)) barrier: linear convergence under strong convexity

Connections to Other Areas

Stochastic optimization bridges numerical methods, statistics, and probability theory.

  • Stochastic Optimisation: SGD and Its Improvements — Introductory chapter on SGD, momentum, and Adam
  • Online Convex Optimization — AdaGrad and Adam originated in the OCO literature
  • Large-Scale Optimisation: From Cluster to Supercomputer — Distributed SGD scales adaptive methods to thousands of GPUs

Вопросы для размышления

  • Why does doubling the batch size not halve the number of steps needed - large batch training does not scale linearly?
  • How is the warm-up learning rate schedule related to gradient instability at random initialization?
  • What is the fundamental difference between variance reduction methods and gradient clipping as ways to improve SGD?
Stochastic Optimization and Adaptive Methods

0

1

Sign In