Convex Optimization

Stochastic Optimisation: SGD and Its Improvements

Every time one train a neural network with Adam - one're using online learning theory and variance reduction. YouTube's recommendation system updates online: every click is one OGD step. The fundamental trade-off «convergence rate ↔ gradient cost» determines how fast the model learns.

Adam, AdamW, RMSProp: practical implementations of adaptive online methods
Federated learning: SVRG-like algorithms for training on distributed data
Recommendation systems: online updates via OGD on each click
Language models: AdamW + gradient clipping - the standard configuration

SGD: Convergence Theory and the Noise/Speed Trade-off

**Stochastic Gradient Descent (SGD)** is the workhorse algorithm in ML. At each step a stochastic gradient g_k ≈ ∇f(x_k) is computed from a random mini-batch. Gradient noise speeds up the early phase but hinders final convergence. For smooth convex functions the rate is O(1/√T).

**Mini-batch: noise/parallelism trade-off:** With a mini-batch of size B the gradient variance drops by a factor of B: σ²_batch ≈ σ²/B. At the same time the batch operation is computed in parallel on the GPU. The optimal B depends on the architecture: for modern GPUs B = 256 - 2048, not B = 1 (SGD) and not B = n (GD).

Why does SGD with a constant step size α not converge to the exact minimum of a convex function?

Variance Reduction: SVRG and SAGA

**Variance reduction methods** (SVRG, SAGA, SAG) achieve linear convergence O(exp(−T/(n+κ))) for finite sums at a cost close to SGD. The idea: store control-point gradients and correct the stochastic gradient so that its variance decays to zero.

**When to use SVRG/SAGA:** Variance reduction methods are beneficial for μ-strongly convex f and moderate n (so that the full gradient recomputation in SVRG is affordable). In deep learning the function is not strongly convex, so Adam/AdamW and classical SGD with momentum are more commonly used in practice.

How does SVRG reduce the variance of the stochastic gradient?

Online Learning and Regret Bounds

**Online learning** is a game between the algorithm and nature: at step t the algorithm picks x_t, nature reveals loss function f_t, the algorithm incurs loss f_t(x_t). **Regret** R_T = ∑f_t(x_t) − min_x ∑f_t(x) - the gap relative to the best fixed solution. Goal: R_T = o(T) (sublinear regret).

**AdaGrad → Adam:** AdaGrad accumulates all past squared gradients, which with long training makes the step too small. Adam (Kingma & Ba, 2014) uses an exponential moving average of squared gradients (RMSProp) + first-moment momentum. This fixes AdaGrad's accumulation problem and is the standard for deep learning.

What is regret in online learning and why is the goal sublinear regret?

Key Ideas

SGD: unbiased stochastic gradient, O(1/√T) or O(1/T) convergence; constant step → neighbourhood of minimum
n vs T trade-off: SGD wins for large n when moderate accuracy is needed
SVRG: full gradient as anchor + correction → linear convergence at ≈ SGD cost
SAGA: gradient table instead of anchor, no epochs, better for online updates
Regret O(√T): OGD is optimal for online learning; AdaGrad adapts to sparse features

Вопросы для размышления

Why is Adam used for deep neural networks rather than SVRG/SAGA, even though the latter have better theoretical guarantees?
How does learning rate warmup (gradually increasing α at the start of training) relate to variance reduction theory?
Construct an example where OGD has O(√T) regret but a greedy algorithm has O(T) regret. What is essential about this example?

Связанные уроки

prob-13-clt

SGD: Convergence Theory and the Noise/Speed Trade-off

Why does SGD with a constant step size α not converge to the exact minimum of a convex function?

Variance Reduction: SVRG and SAGA

How does SVRG reduce the variance of the stochastic gradient?

Online Learning and Regret Bounds

What is regret in online learning and why is the goal sublinear regret?

Key Ideas

SGD: unbiased stochastic gradient, O(1/√T) or O(1/T) convergence; constant step → neighbourhood of minimum

n vs T trade-off: SGD wins for large n when moderate accuracy is needed

SVRG: full gradient as anchor + correction → linear convergence at ≈ SGD cost

SAGA: gradient table instead of anchor, no epochs, better for online updates

Regret O(√T): OGD is optimal for online learning; AdaGrad adapts to sparse features

Stochastic Optimisation: SGD and Its Improvements

SGD: Convergence Theory and the Noise/Speed Trade-off

Variance Reduction: SVRG and SAGA

Online Learning and Regret Bounds

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Stochastic Optimisation: SGD and Its Improvements

SGD: Convergence Theory and the Noise/Speed Trade-off

Variance Reduction: SVRG and SAGA

Online Learning and Regret Bounds

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки