Machine Learning
Gradient Descent
From Cauchy's slopes to stochastic learning
In 1847 the French mathematician Augustin-Louis Cauchy described the method of steepest descent: to find the minimum of a function, step repeatedly in the direction opposite to its gradient. He used it to solve systems of equations in astronomy, more than a century before computers existed. The second piece arrived in 1951, when Herbert Robbins and Sutton Monro published their work on stochastic approximation. They showed that convergence toward a minimum is possible using noisy, sampled estimates of the gradient rather than the exact one. That insight is the mathematical root of stochastic gradient descent, the workhorse that trains nearly every modern neural network.
When ChatGPT generates text and Midjourney draws pictures, inside there are neural networks with billions of parameters. But how does the computer find the right values for these billions of numbers? Not by brute force - with GPT-3's 175 billion parameters the number of combinations exceeds the number of atoms in the Universe. The answer: gradient descent - an algorithm that takes tiny steps toward the optimal solution, processing data in portions. And the size of those portions and how large the steps are determine whether a model trains in hours or never converges.
- **GPT-4** was trained for several months on thousands of GPUs - and all of that training amounts to billions of mini-batch gradient descent steps with a carefully tuned learning rate schedule
- **Tesla Autopilot** is continuously fine-tuned on new data from cameras in millions of vehicles - SGD allows the model to be updated incrementally without retraining from scratch each time
- **Choosing the learning rate** can cost a company millions of dollars: too large and training diverges, wasting weeks of GPU time; too small and the model fails to converge within the allocated budget
Предварительные знания
Batch Gradient Descent
In the linear regression lesson we learned that a model searches for parameters (weights) that minimize the loss function. But *how exactly* do you find the minimum? Picture standing on a mountain slope in thick fog. You can't see the summit, but you can feel the incline under your feet. The strategy is simple - step in the direction of the steepest descent. That is **gradient descent** - the main optimization algorithm in machine learning.
The **gradient** is a vector of partial derivatives of the loss function with respect to each parameter. It points in the direction of the *fastest increase* of the function. To move toward the minimum, we go *against* the gradient. The weight update formula at each step: **w = w - lr * gradient(J)**, where lr (learning rate) is the step size, and J is the loss function.
**Batch Gradient Descent** is the classic variant: at each step the gradient is computed over the **entire** dataset. The word "batch" here means "the whole bunch of data at once". The algorithm sums the errors over all examples, averages them, computes the direction and takes one step. Then repeats. For convex loss functions - e.g., MSE in linear regression - batch GD **guarantees** convergence to the global minimum.
**The main problem with batch GD is speed.** If the dataset has 10 million examples, every step requires processing all 10 million. The gradient is accurate and stable, but a single step may take minutes. And thousands of steps are needed. For large data (ImageNet - 14M images, GPT - hundreds of billions of tokens) pure batch GD is impractical.
**Iteration vs epoch:** - **Iteration** - one weight update step - **Epoch** - one full pass over the entire dataset In batch GD one iteration = one epoch (because each step uses all the data). In stochastic and mini-batch GD one epoch contains many iterations - this is the key distinction we'll cover next.
Why is batch gradient descent poorly suited for a dataset of 50 million examples?
Stochastic Gradient Descent (SGD)
What if instead of processing all 50 million examples at every step we update the weights after **each individual example**? That is **Stochastic Gradient Descent (SGD)** - instead of the exact gradient over the whole dataset we use an approximate gradient from a single random example. The word "stochastic" means "random" - hence the name.
Think of the difference like this: batch GD is like surveying every citizen in a city before every decision. SGD is like asking a single random passerby. One person's answer is inaccurate (noisy), but you get it instantly. In the time it takes for one accurate survey (batch) you can already ask thousands of passersby (SGD) and take thousands of steps. In the end SGD often reaches a good solution **faster**, despite the noise.
**Why is shuffling the data critically important?** If the data is ordered (e.g., all positive examples first, then negative), SGD will first "learn" one part, then abruptly switch to the other. The model will oscillate instead of converging. Shuffling before each epoch ensures that each step receives a random, representative example.
The noise of SGD has an unexpected **upside**: it helps escape **local minima**. Batch GD carefully slides into the nearest "valley" and gets stuck there. SGD, due to noise, can jump over low barriers and find a better minimum. For complex neural networks whose loss landscape has millions of local minima, this property is extremely valuable.
What advantage does the noisiness of SGD give compared to batch GD?
Mini-batch Gradient Descent
Batch GD: accurate but slow. SGD: fast but noisy. Is there a golden mean? **Mini-batch Gradient Descent** - instead of the whole dataset or one example, we take a **small group** (mini-batch) of 32, 64, 128 or 256 examples. This is the **standard training method** in PyTorch, TensorFlow, and all modern frameworks.
Why powers of two specifically (32, 64, 128, 256)? This is related to **GPU** architecture. GPUs process data in parallel blocks, and their memory is organized in powers of two. A batch size of 64 uses GPU hardware resources significantly more efficiently than, say, batch size 60. Typical choice: **32** for limited GPU memory, **256** when memory is sufficient.
**Epoch in mini-batch GD:** One epoch = one full pass over the dataset. If the dataset has 10 000 examples and batch_size = 128, one epoch contains ceil(10000 / 128) = 79 iterations (mini-batches). Over 100 epochs the model makes 7 900 weight updates. By comparison, batch GD over 100 epochs makes exactly 100 updates - while the total amount of data processed is the same.
**Batch size affects generalization!** Research shows that smaller batch sizes (32-128) often lead to better model generalization than large ones (1024+). Small batches create more gradient noise, which acts as implicit regularization - helping the model avoid overfitting. Therefore, increasing batch size for speed is not always a good idea.
A dataset contains 64 000 examples, batch_size = 256. How many weight updates happen in one epoch?
Learning Rate: the key hyperparameter
The formula **w = w - lr * gradient** contains one number that determines everything: the **learning rate (lr)**. This is arguably the **most important hyperparameter** in machine learning. Too large an lr - the model jumps past the minimum and diverges. Too small - the model barely crawls and gets stuck. A wrong lr is most often to blame when "the model isn't learning".
In practice the learning rate does not stay constant throughout training. **Learning rate schedules** are used - strategies for changing lr as training progresses. The logic is simple: start with large steps to quickly approach the minimum, then reduce the step size for fine-tuning.
**Warmup: warming up before training** Modern models (BERT, GPT, ResNet) often use warmup: for the first few epochs the learning rate *rises* linearly from zero to the target value, then starts to decrease. Why? At the start of training the weights are random, gradients are large and unstable. A large lr at this stage will "blow up" training. Warmup gives the model time to stabilize. Typical scheme: 5-10% of total steps for warmup, then cosine decay.
**Momentum** is another technique that radically accelerates convergence. The idea: imagine a ball rolling down a hill. It builds up inertia: if the gradient points in the same direction for several steps in a row, the ball speeds up. If the direction changes - inertia smooths out the oscillations. Mathematically: **v = beta * v + gradient; w = w - lr * v**, where beta (usually 0.9) is the inertia coefficient and v is the "velocity".
**Saddle points are a real problem in neural networks.** In a function with thousands of parameters, local minima are not the main threat. Far more common are **saddle points**: the gradient is zero, but this is not a minimum - in some directions the function goes down, in others up (like a horse saddle). SGD with momentum and Adam handle them better than pure gradient descent, thanks to accumulated inertia.
You need to find one perfect learning rate and use it throughout training
In modern ML the learning rate almost always changes according to a schedule: warmup, decay, cosine annealing. A single fixed lr is rare
At the start of training large steps are needed for fast progress, while near the minimum small steps are needed for fine-tuning. The loss landscape is also non-uniform: optimal step size differs across regions. Adam solves this automatically by adapting the lr for each parameter individually.
Learning rate = 0.5, the model does not converge: loss bounces up and down without decreasing. What action will most likely solve the problem?
Key ideas
- **Gradient descent** is an iterative optimization algorithm: compute the gradient of the loss function and take a step *against* it (w = w - lr * gradient)
- **Three variants:** batch GD (whole dataset, accurate, slow), SGD (1 example, fast, noisy), mini-batch (32-256 examples, optimal compromise)
- **Mini-batch GD** is the industry standard: uses GPU parallelism, sufficiently stable and fast - this is what PyTorch and TensorFlow implement
- **Learning rate** is the most important hyperparameter: too large = divergence, too small = stalling. Solutions: lr schedule (warmup + decay), momentum and Adam
- Back to the opening question: the billions of GPT parameters are found exactly this way - billions of tiny mini-batch gradient descent steps, each slightly reducing the error
Related topics
Gradient descent is the foundation of optimization in ML, running through all subsequent topics. Here is how it connects to earlier material and to topics covered next:
- Linear Regression — In linear regression, gradient descent minimizes MSE - the simplest example of the technique, which we used to demonstrate all three variants
- Regularization — L1/L2 regularization adds penalty terms to the loss function, changing the landscape that gradient descent descends
- Optimizers: Adam, RMSProp, AdaGrad — Advanced optimizers are an evolution of SGD with momentum: adaptive learning rates, gradient clipping and other improvements
- Backpropagation — Backpropagation computes gradients in neural networks - without it gradient descent would not know which way to step in multi-layer networks
Вопросы для размышления
- If SGD's noise helps escape local minima, why not add even more noise intentionally? Is there a limit to how useful noise is in optimization?
- Why is SGD + momentum typically used for computer vision (ResNet) while Adam is used for NLP (Transformers)? What is it about the nature of these tasks that calls for different optimization approaches?
- If a learning rate schedule defines in advance how lr will change, do we lose adaptability? What if the optimal moment to change lr depends on the specific dataset?
Связанные уроки
- ml-06-linear-regression — SGD is the alternative to the normal equation for large matrices
- ml-28-optimizers — Adam, Adagrad, RMSProp are GD variants with adaptive learning rate
- ml-08-regularization — Gradient of a regularized loss = gradient + penalty term
- calc-19-gradient
- la-06-gauss