Machine Learning
Optimizers: SGD, Adam, RMSProp
The same neural network with the same architecture can train in 10 minutes or never converge at all. The difference is the choice of optimizer and learning rate. The Google Brain team spent millions of GPU hours on experiments and found that switching from Adam to SGD with Momentum changes training time by 3-5x for large language models. With a budget of tens of millions of dollars per training run, the right optimizer is not an academic question - it's a decision worth millions. The sections ahead cover how the major optimizers work and when to use each.
- **Training GPT and LLMs** - OpenAI, Google, Meta use AdamW with warmup and cosine scheduling to train models costing tens of millions of dollars, and the wrong lr schedule can double the budget
- **Computer Vision in production** - Tesla Autopilot, medical diagnostics, industrial quality control: SGD + Momentum with cosine annealing remains the standard because it gives better generalization on real data
- **Fine-tuning pretrained models** - adapting BERT, GPT, Stable Diffusion to a specific task requires a tiny lr (2e-5) with warmup, otherwise the model forgets its pretrained knowledge (catastrophic forgetting)
Half a century of ideas converging in Adam
Modern optimizers are layers of ideas accumulated over half a century. In 1964 Soviet mathematician Boris Polyak proposed momentum (the heavy ball method), which smooths the steps of gradient descent through inertia. In 1983 Yurii Nesterov improved it with an accelerated gradient that "looks ahead". In 2011 John Duchi and coauthors introduced AdaGrad, the first adaptive method with a per-parameter learning rate. Around 2012 Geoffrey Hinton proposed RMSprop in his Coursera course, fixing the decaying step size of AdaGrad. Finally, in 2014 Diederik Kingma and Jimmy Ba combined momentum and an adaptive step in Adam, which became the default optimizer for most deep networks. Each idea fixed a weakness of the previous one, and together they merged into the industry standard.
Предварительные знания
Momentum - physical intuition
Vanilla SGD (Stochastic Gradient Descent) updates weights strictly in the direction of the gradient at the current step. The problem is that the loss surface is rarely smooth and round. More often it is elongated, like a narrow canyon: the loss changes slowly along one axis and sharply along the other. SGD starts **oscillating** across the canyon (where the gradient is large) while barely moving along it (where the gradient is small). The result - training proceeds painfully slowly.
**Momentum** solves this problem by analogy with physics: imagine a ball rolling down a surface. The ball doesn't just follow the slope at each point - it **accumulates velocity**. If the slope consistently points in one direction, the velocity grows. If the slope reverses direction (oscillations), velocities in opposite directions cancel each other out. The result - the ball moves quickly along the canyon without oscillating across it.
**Why beta = 0.9?** With beta = 0.9, momentum "remembers" approximately the last 10 gradients (1 / (1 - 0.9) = 10). Gradients pointing in the same direction accumulate and accelerate movement. Gradients in opposite directions (oscillations) cancel each other out. - **beta = 0** - no momentum, plain SGD - **beta = 0.9** - standard, works well almost always - **beta = 0.99** - strong momentum, slower to react to changes - **beta -> 1** - almost never forgets the past, can overshoot the minimum
**Nesterov Momentum** is an improvement over vanilla momentum. The idea: before computing the gradient, take a preliminary step along the current velocity ("look ahead"). If momentum is carrying us in the wrong direction, Nesterov corrects course earlier. In practice, Nesterov provides a small but consistent improvement in convergence - so `nesterov=True` is recommended as the default when using SGD + Momentum.
Why does SGD with Momentum converge faster than vanilla SGD on elongated (ill-conditioned) loss surfaces?
RMSProp - adaptive learning rate
Momentum addresses oscillations, but has a limitation: **one learning rate for all parameters**. What if some parameters receive large gradients (and need cautious updates) while others receive small ones (and can be updated more boldly)? Consider word embeddings: the word "the" appears in every batch and gets huge gradients, while "quasar" appears once every thousand batches with a tiny gradient. One lr doesn't fit both.
**RMSProp** (Root Mean Square Propagation) adapts the learning rate for each parameter individually. The idea: track a **running average of squared gradients** for each parameter. If a parameter regularly receives large gradients, its lr decreases. If gradients are small - lr increases. Result: all parameters are updated at roughly the same "speed" regardless of gradient magnitude.
**Historical note:** RMSProp was proposed by **Geoffrey Hinton** in a Coursera lecture in 2012. It was **never formally published** as a research paper - just a slide in an online course. Despite this, RMSProp became one of the most popular optimizers, used by millions of practitioners. This is a rare case where an important algorithm spread purely through oral tradition and informal references. The predecessor of RMSProp is **AdaGrad** (2011). AdaGrad's problem: it accumulates ALL squared gradients from the beginning of training, and lr monotonically decays to zero. RMSProp fixes this with a running average - old gradients are "forgotten".
**Learning rate for adaptive optimizers:** Typical lr for SGD: 0.01 - 0.1 Typical lr for RMSProp/Adam: 0.0001 - 0.001 Adaptive methods already scale the lr for each parameter, so the base lr should be smaller. Using lr=0.01 with RMSProp may cause training to diverge.
What does RMSProp do to a parameter that consistently receives very large gradients?
Adam - the best of both worlds
Adam (Adaptive Moment Estimation) combines the ideas of **Momentum** and **RMSProp** in a single optimizer. Momentum accumulates the first moment (mean of gradients, direction), RMSProp the second moment (mean of squared gradients, scale). Adam tracks both: the first moment determines the **direction** of the update, the second moment determines the **step size** for each parameter. The result - fast convergence (Momentum) with an adaptive learning rate (RMSProp).
**Why is Bias Correction needed?** At the start of training, m and v are initialized to zeros. On the first step: - m = 0.9 * 0 + 0.1 * g = 0.1 * g - heavily underestimated! - v = 0.999 * 0 + 0.001 * g^2 = 0.001 * g^2 - even more underestimated! Bias correction compensates for this: - m_hat = m / (1 - 0.9^1) = 0.1*g / 0.1 = g - correct estimate - v_hat = v / (1 - 0.999^1) = 0.001*g^2 / 0.001 = g^2 - correct After 10-20 steps the correction becomes negligible (beta^t -> 0), but for the first steps it is critically important for stability.
Why did Adam become the **default optimizer** for most tasks? It barely requires hyperparameter tuning: beta1=0.9, beta2=0.999, lr=0.001 work surprisingly well across a wide range of architectures and tasks. Adam is especially good for: sparse gradients (NLP, embeddings), noisy gradients (small batches), non-stationary tasks (reinforcement learning). If you don't know which optimizer to choose - start with Adam.
**AdamW vs Adam:** In vanilla Adam, weight decay (L2 regularization) is applied to the gradient before adaptation, which weakens its effect for parameters with large gradients. **AdamW** (decoupled weight decay) applies weight decay directly to the weights, independent of adaptation. This gives more uniform regularization. AdamW became the standard for training Transformers - BERT, GPT, Vision Transformer all use AdamW. If you're working with Transformer architectures, use AdamW.
What does Adam track in addition to the running average of squared gradients (second moment)?
Learning Rate Scheduling
Even with adaptive optimizers like Adam, the **base learning rate** remains the most important hyperparameter. A constant lr is a compromise: a large lr allows fast exploration of the parameter space early on, but hinders fine-tuning at the end (the model "jumps" around the minimum). A small lr gives precise fine-tuning but makes the beginning of training take forever. The solution - **change lr during training**: start large and gradually decrease.
- **Step Decay** - decrease lr by N times every K epochs (e.g., lr /= 10 every 30 epochs). Simple and effective, standard for ResNet
- **Exponential Decay** - lr = lr_0 * gamma^epoch (e.g., gamma=0.95). Smooth decrease, but lr may become too small too early
- **Cosine Annealing** - lr follows a cosine curve from the initial value down to a minimum. Smoother than step decay, popular for state-of-the-art models
- **Warmup** - lr starts very small and linearly increases to the target over the first N steps. Critically important for Transformers: without warmup, training often diverges
- **One-Cycle Policy** (Leslie Smith) - lr first increases from min to max, then drops back below min. One of the best strategies in terms of simplicity-to-result ratio
| Strategy | When to use | Pros | Cons |
|---|---|---|---|
| Step Decay | CNN (ResNet, VGG) | Simplicity, predictability | Abrupt lr jumps |
| Cosine Annealing | Most tasks | Smooth decay | Need to know T_max |
| Warmup + Cosine | Transformers (BERT, GPT) | Stability + smoothness | Two hyperparameters |
| One-Cycle | Fast training | Often best result | Need max_lr |
| Exponential | Simple tasks | Single gamma parameter | lr drops too fast |
**Warmup - why is it needed for Transformers?** In the first steps of training, weights are random, so gradients are unstable and can be very large. If you immediately apply a large lr, updates will be enormous and the model will "explode" (diverge). Warmup starts with a tiny lr and linearly increases it over the first 5-10% of training. During this time the weights acquire reasonable values, gradients stabilize, and the model is ready to train with the full lr. Standard for Adam + Transformer: warmup for 5-10% of steps, then cosine decay.
Adam is always better than SGD, so SGD is outdated and not worth using
SGD with Momentum and proper learning rate scheduling often generalizes better than Adam in computer vision tasks, achieving higher accuracy on test data
Research has shown that Adam tends to find sharper minima on the loss surface, which generalize worse to new data. SGD + Momentum with cosine scheduling finds flatter minima - they are more robust to small changes in input data. That is why state-of-the-art computer vision models (ResNet, EfficientNet) are trained with SGD + Momentum, while Transformers (BERT, GPT) use AdamW. The choice of optimizer depends on the architecture and task.
Why is learning rate warmup critically important for training Transformer models?
Summary
- **Momentum** accumulates "velocity" in weight updates: gradients in the same direction sum up and accelerate convergence, while oscillations in opposite directions cancel each other out. Beta=0.9 means inertia over the last ~10 steps
- **RMSProp** adapts the learning rate for each parameter individually: parameters with large gradients get a small lr, those with small gradients get a large lr. This balances the update speed of all parameters
- **Adam** combines Momentum (first moment - direction) and RMSProp (second moment - scale), adding bias correction for the first steps. Default parameters (lr=0.001, beta1=0.9, beta2=0.999) work for most tasks
- **Learning rate scheduling** changes lr during training: warmup stabilizes the start, cosine annealing or step decay ensures precise convergence at the end. Warmup is mandatory for Transformers
- **There is no universally best optimizer:** Adam - for prototypes and Transformers, SGD + Momentum - for computer vision, AdamW - for fine-tuning. The same optimizer and lr choice that separates 10-minute training from infinite divergence can be worth millions of dollars in GPU time at Google Brain scale
Related topics
Optimizers are the central element of training any neural network, connecting the theory of gradient descent to the practice of deep learning:
- Gradient Descent — The foundation of all optimizers: SGD, Momentum, RMSProp, Adam are extensions of basic gradient descent. Understanding the loss surface landscape and the problem of local minima is critical for choosing an optimizer
- CNN (Convolutional Neural Networks) — For training CNNs the standard remains SGD + Momentum with cosine scheduling, not Adam. Architecture affects optimizer choice: convolutional layers generalize better with SGD, attention layers with AdamW
- Transformers — Transformers require AdamW with warmup + cosine decay. Without warmup, the attention mechanism generates unstable gradients on random weights and training diverges
Вопросы для размышления
- Why do Adam's default parameters work well for most tasks, yet SGD + Momentum often gives better generalization in computer vision? What property of loss surface minima might explain this?
- If you were training a model on a dataset where 99% of words are high-frequency (the, is, a) and 1% are rare specialized terms, which optimizer would you choose and why?
- Warmup is critically important for Transformers, but not needed for simple CNNs. What about the Transformer architecture makes the beginning of training especially unstable?
Связанные уроки
- ml-09-gradient-descent — Optimizers extend plain gradient descent
- ml-26-backpropagation — They consume the gradients backprop produces
- ml-43-hyperparameters — Learning rate and betas are key hyperparameters
- calc-19-gradient — Every step moves along the loss gradient
- opt-01 — Adam and SGD are first-order optimization methods
- alg-20-greedy