Convex Optimization
Stochastic Optimisation: SGD and Its Improvements
Every time one train a neural network with Adam - one're using online learning theory and variance reduction. YouTube's recommendation system updates online: every click is one OGD step. The fundamental trade-off «convergence rate ↔ gradient cost» determines how fast the model learns.
- Adam, AdamW, RMSProp: practical implementations of adaptive online methods
- Federated learning: SVRG-like algorithms for training on distributed data
- Recommendation systems: online updates via OGD on each click
- Language models: AdamW + gradient clipping - the standard configuration
SGD: Convergence Theory and the Noise/Speed Trade-off
**Stochastic Gradient Descent (SGD)** is the workhorse algorithm in ML. At each step a stochastic gradient g_k ≈ ∇f(x_k) is computed from a random mini-batch. Gradient noise speeds up the early phase but hinders final convergence. For smooth convex functions the rate is O(1/√T).
**Mini-batch: noise/parallelism trade-off:** With a mini-batch of size B the gradient variance drops by a factor of B: σ²_batch ≈ σ²/B. At the same time the batch operation is computed in parallel on the GPU. The optimal B depends on the architecture: for modern GPUs B = 256 - 2048, not B = 1 (SGD) and not B = n (GD).
Why does SGD with a constant step size α not converge to the exact minimum of a convex function?
Variance Reduction: SVRG and SAGA
**Variance reduction methods** (SVRG, SAGA, SAG) achieve linear convergence O(exp(−T/(n+κ))) for finite sums at a cost close to SGD. The idea: store control-point gradients and correct the stochastic gradient so that its variance decays to zero.
**When to use SVRG/SAGA:** Variance reduction methods are beneficial for μ-strongly convex f and moderate n (so that the full gradient recomputation in SVRG is affordable). In deep learning the function is not strongly convex, so Adam/AdamW and classical SGD with momentum are more commonly used in practice.
How does SVRG reduce the variance of the stochastic gradient?
Online Learning and Regret Bounds
**Online learning** is a game between the algorithm and nature: at step t the algorithm picks x_t, nature reveals loss function f_t, the algorithm incurs loss f_t(x_t). **Regret** R_T = ∑f_t(x_t) − min_x ∑f_t(x) - the gap relative to the best fixed solution. Goal: R_T = o(T) (sublinear regret).
**AdaGrad → Adam:** AdaGrad accumulates all past squared gradients, which with long training makes the step too small. Adam (Kingma & Ba, 2014) uses an exponential moving average of squared gradients (RMSProp) + first-moment momentum. This fixes AdaGrad's accumulation problem and is the standard for deep learning.
What is regret in online learning and why is the goal sublinear regret?
Key Ideas
- SGD: unbiased stochastic gradient, O(1/√T) or O(1/T) convergence; constant step → neighbourhood of minimum
- n vs T trade-off: SGD wins for large n when moderate accuracy is needed
- SVRG: full gradient as anchor + correction → linear convergence at ≈ SGD cost
- SAGA: gradient table instead of anchor, no epochs, better for online updates
- Regret O(√T): OGD is optimal for online learning; AdaGrad adapts to sparse features
Related Topics
Stochastic optimisation is the foundation of modern machine learning.
- Proximal Methods — Stochastic ADMM, proximal SGD for constrained problems
- Bayesian Optimisation — Bayesian opt. as an alternative to SGD for expensive loss functions
- Second-Order Methods — L-BFGS vs SGD: the memory/convergence trade-off
Вопросы для размышления
- Why is Adam used for deep neural networks rather than SVRG/SAGA, even though the latter have better theoretical guarantees?
- How does learning rate warmup (gradually increasing α at the start of training) relate to variance reduction theory?
- Construct an example where OGD has O(√T) regret but a greedy algorithm has O(T) regret. What is essential about this example?