Convex Optimization

Stochastic Optimisation: SGD and Its Improvements

Every time one train a neural network with Adam - one're using online learning theory and variance reduction. YouTube's recommendation system updates online: every click is one OGD step. The fundamental trade-off «convergence rate ↔ gradient cost» determines how fast the model learns.

  • Adam, AdamW, RMSProp: practical implementations of adaptive online methods
  • Federated learning: SVRG-like algorithms for training on distributed data
  • Recommendation systems: online updates via OGD on each click
  • Language models: AdamW + gradient clipping - the standard configuration

SGD: Convergence Theory and the Noise/Speed Trade-off

**Stochastic Gradient Descent (SGD)** is the workhorse algorithm in ML. At each step a stochastic gradient g_k ≈ ∇f(x_k) is computed from a random mini-batch. Gradient noise speeds up the early phase but hinders final convergence. For smooth convex functions the rate is O(1/√T).

**Mini-batch: noise/parallelism trade-off:** With a mini-batch of size B the gradient variance drops by a factor of B: σ²_batch ≈ σ²/B. At the same time the batch operation is computed in parallel on the GPU. The optimal B depends on the architecture: for modern GPUs B = 256 - 2048, not B = 1 (SGD) and not B = n (GD).

Why does SGD with a constant step size α not converge to the exact minimum of a convex function?

Variance Reduction: SVRG and SAGA

**Variance reduction methods** (SVRG, SAGA, SAG) achieve linear convergence O(exp(−T/(n+κ))) for finite sums at a cost close to SGD. The idea: store control-point gradients and correct the stochastic gradient so that its variance decays to zero.

**When to use SVRG/SAGA:** Variance reduction methods are beneficial for μ-strongly convex f and moderate n (so that the full gradient recomputation in SVRG is affordable). In deep learning the function is not strongly convex, so Adam/AdamW and classical SGD with momentum are more commonly used in practice.

How does SVRG reduce the variance of the stochastic gradient?

Online Learning and Regret Bounds

**Online learning** is a game between the algorithm and nature: at step t the algorithm picks x_t, nature reveals loss function f_t, the algorithm incurs loss f_t(x_t). **Regret** R_T = ∑f_t(x_t) − min_x ∑f_t(x) - the gap relative to the best fixed solution. Goal: R_T = o(T) (sublinear regret).

**AdaGrad → Adam:** AdaGrad accumulates all past squared gradients, which with long training makes the step too small. Adam (Kingma & Ba, 2014) uses an exponential moving average of squared gradients (RMSProp) + first-moment momentum. This fixes AdaGrad's accumulation problem and is the standard for deep learning.

What is regret in online learning and why is the goal sublinear regret?

Key Ideas

  • SGD: unbiased stochastic gradient, O(1/√T) or O(1/T) convergence; constant step → neighbourhood of minimum
  • n vs T trade-off: SGD wins for large n when moderate accuracy is needed
  • SVRG: full gradient as anchor + correction → linear convergence at ≈ SGD cost
  • SAGA: gradient table instead of anchor, no epochs, better for online updates
  • Regret O(√T): OGD is optimal for online learning; AdaGrad adapts to sparse features

Related Topics

Stochastic optimisation is the foundation of modern machine learning.

  • Proximal Methods — Stochastic ADMM, proximal SGD for constrained problems
  • Bayesian Optimisation — Bayesian opt. as an alternative to SGD for expensive loss functions
  • Second-Order Methods — L-BFGS vs SGD: the memory/convergence trade-off

Вопросы для размышления

  • Why is Adam used for deep neural networks rather than SVRG/SAGA, even though the latter have better theoretical guarantees?
  • How does learning rate warmup (gradually increasing α at the start of training) relate to variance reduction theory?
  • Construct an example where OGD has O(√T) regret but a greedy algorithm has O(T) regret. What is essential about this example?

Связанные уроки

  • prob-13-clt
Stochastic Optimisation: SGD and Its Improvements

0

1

Sign In