Deep Learning
Backpropagation: How Neural Networks Learn
Geoffrey Hinton and the revival of neural networks
In 2006, Geoffrey Hinton published work on deep belief networks that reignited interest in neural networks after a long AI winter. He showed that deep networks could be trained by pre-training layers one at a time, working around the vanishing gradients problem. Six years later his student Alex Krizhevsky applied backprop with ReLU and dropout on a GPU - and AlexNet changed everything.
Цели урока
- Understand the chain rule as the mathematical foundation of backpropagation
- Distinguish forward and backward pass and know their computational complexity
- Read and understand the computational graph for simple functions
- Use PyTorch autograd to compute gradients
Backpropagation was rediscovered 4 times: 1960 (Kelley), 1970 (Linnainmaa), 1974 (Werbos), 1986 (Rumelhart). One algorithm - the entire deep network. Every GPT token is one forward plus backward pass. GPT-4 with 1.8 trillion parameters uses the same chain rule that solves y = sin(x2) in 2 lines of code.
- **Training GPT-4** - without backpropagation, computing 1.8T gradients would be economically impossible: billions of times more expensive
- **AlphaFold2 (DeepMind)** - backprop trains the network to predict 3D protein structure, a revolution in molecular biology
- **Tesla Autopilot** - network trains on millions of camera frames, backprop adjusts millions of parameters after each batch
- **Stable Diffusion** - each denoising step in reverse diffusion is a forward pass, DDPM trains via backprop across all timesteps
- **GitHub Copilot** - Codex (GPT-3 fine-tuned on code) trained via backprop on 159 GB of source code
Предварительные знания
Chain Rule: derivative of a composite function
Backpropagation was rediscovered four times: Kelley (1960), Linnainmaa (1970), Werbos (1974), Rumelhart (1986). The same algorithm every time. The math behind it is the **chain rule**: if f depends on g, and g depends on x, then the derivative of f with respect to x equals the product of local derivatives. Formally: if y = f(g(x)), then dy/dx = df/dg * dg/dx.
A neural network is simply a very long chain of functions. ResNet-152 has 152 such links. GPT-4 has thousands. The chain rule makes it possible to compute how each parameter change affects the final loss, without manually unrolling the entire chain.
**For multiple variables** the chain rule works the same way. If z = f(x, y), x = x(t), y = y(t), then dz/dt = (dz/dx) * (dx/dt) + (dz/dy) * (dy/dt). In neural networks this means: the gradient with respect to a weight depends on which neurons the signal passes through.
**Numerical gradient checking** is an important debugging tool. If the analytical and numerical gradients differ by more than 1e-5, there is a bug in the code. Karpathy recommends always checking gradients when writing custom layers. `torch.autograd.gradcheck` automates this.
Given: y = sin(x²). What is dy/dx by the chain rule?
Forward and Backward Pass
Every GPT-4 token involves one forward pass and one backward pass. Forward pass: data flows from input to output, the network makes a prediction. Backward pass: the error flows from output to input, gradients are computed for each of the 1.8 trillion parameters. This is backpropagation - the chain rule applied systematically.
**Vanishing gradients:** when using sigmoid the derivative <= 0.25. Through 10 layers: 0.25^10 approx 0.000001 - gradient vanishes. **Exploding gradients:** if weights > 1, gradients grow exponentially and become NaN. Solutions: ReLU (vanishing), gradient clipping (exploding), residual connections (both). ResNet solved vanishing gradients in 2015 with skip connections.
| Problem | Symptom | Solution |
|---|---|---|
| Vanishing gradients | Early layers do not train, loss stays flat | ReLU, residual connections, batch norm |
| Exploding gradients | Loss = NaN, weights → +-infinity | Gradient clipping, proper initialization |
| Dead neurons | Neuron always outputs 0 (ReLU) | Leaky ReLU, proper learning rate |
**Key observation:** the backward pass has the same computational complexity as the forward pass - O(n), where n is the number of operations. This means gradients for ALL parameters (millions or billions) are computed in a single pass. That is precisely what made deep learning economically viable.
What happens to gradients in a deep network when sigmoid is used in all layers?
Computational Graph: a map of computations
Millions of operations in a network - how do gradients get organized? A **computational graph** is a directed acyclic graph (DAG) where each node is an operation and edges are data flows. During the forward pass each node computes its result and stores local gradients. During the backward pass it passes the gradient backward. This is not an abstraction - it is exactly how PyTorch stores the graph internally.
The elegance of the computational graph: each node only knows its own **local** operation. The "+" node knows the derivative of a sum with respect to each input = 1. The "*" node knows the derivative of a product with respect to one input = the other input. All the remaining work is done by the chain rule - local gradients are simply multiplied along the path. No global knowledge of the network is needed.
This Value class is a simplified version of how PyTorch works internally. Andrej Karpathy built the library **micrograd** on this foundation (100 lines of code), which implements a fully functional autograd engine. Understanding this code is the key to understanding PyTorch.
**The `grad +=` pattern** (not `grad =`) matters. If a variable is used in multiple operations, gradients from all paths are **summed**. This follows from the multivariate chain rule: dL/dx = sum of (dL/dyi * dyi/dx) over all yi depending on x. In BERT, one token embedding is used across hundreds of attention operations - gradients from all paths accumulate.
In the computational graph for f = (x + y) · z with x=-2, y=5, z=-4, what is ∂f/∂y?
Autograd: automatic differentiation
Computing gradients by hand for 1.8 trillion parameters is not possible. PyTorch does it automatically through **autograd**: every operation on tensors is recorded in a computational graph, and when `.backward()` is called, gradients for all parameters are computed in a single pass. The same algorithm as above - just in 100 lines of C++ instead of Python.
In practice, autograd is used in the training loop. The forward pass computes the loss, the backward pass computes gradients, the optimizer updates the weights. This pattern is identical for XOR on 2 neurons and for LLaMA-70B.
**optimizer.zero_grad() is critical!** By default PyTorch **accumulates** gradients (+=, not =). Without zeroing, gradients from the previous batch add to the current ones and training breaks. A common bug in production code.
**torch.no_grad()** disables computational graph construction. Use it during inference and validation - it saves memory (no intermediate values stored for the backward pass) and speeds up computation by 20-40%.
To debug gradients: `tensor.grad_fn` shows which operation created the tensor. `tensor.requires_grad` checks if it is being tracked. If `.grad` is `None` after `.backward()` - the tensor was not part of the computational graph (forgot `requires_grad=True`).
Backpropagation is the algorithm for training neural networks. Backprop = gradient descent.
Backpropagation is an algorithm for **computing gradients** via the chain rule. Gradient descent (or Adam, SGD, RMSProp) is the algorithm that **uses** those gradients to update weights.
Separation of concerns: backprop computes dL/dw for every weight, and the optimizer decides how to update the weights. The same backprop with different optimizers (SGD, Adam, RMSProp) produces different training results. Adam applies adaptive learning rates per parameter - but the gradients still come from backprop.
Why call optimizer.zero_grad() before loss.backward()?
Key ideas
- **Chain rule** - the mathematical foundation: derivative of a composite function = product of local derivatives. Backpropagation is the chain rule applied to a computational graph
- **Forward pass** computes the prediction, **backward pass** computes gradients for all weights. Both have O(n) complexity
- **Computational graph** organizes computations: each node knows only its local gradient. PyTorch builds this graph automatically
- **Autograd** automates everything: .backward() computes gradients, optimizer.step() updates weights. optimizer.zero_grad() is mandatory
- Vanishing gradients (sigmoid) crippled early deep networks. ReLU plus residual connections (ResNet, 2015) solved the problem
- Backprop was rediscovered 4 times. Geoffrey Hinton in 2006 brought deep networks back from the AI winter via deep belief networks
- Backprop is not gradient descent. Backprop computes gradients. Adam/SGD decide what to do with them
Related topics
Backpropagation is the central algorithm of deep learning:
- Neural Networks — Backprop trains the networks built in the previous lesson
- PyTorch vs TensorFlow — Frameworks implement autograd differently
- Gradient Descent — Backprop computes gradients, GD uses them to update weights
Вопросы для размышления
- Backpropagation was discovered in the 1970s, but neural networks only took off in 2012. What changed - algorithms, data, or hardware?
- Why does PyTorch accumulate gradients by default (+=) rather than overwrite (=)? In what situations is accumulation useful?
- If autograd computes gradients automatically, why should a developer understand how backpropagation works?
Связанные уроки
- dl-01 — Neural network architecture that backprop trains
- dl-03 — PyTorch and TensorFlow implement autograd differently
- ml-09-gradient-descent — Backprop computes gradients, gradient descent uses them
- calc-08-chain-rule — Chain rule is the mathematical foundation of all backprop
- calc-18-partial — Partial derivatives for the multivariate chain rule
- ml-28-optimizers — Adam, RMSProp, SGD - algorithms that consume backprop output