Calculus
Gradient and Directional Derivative
Gradient descent trains GPT-4 over thousands of GPU-hours. Every step moves against $\nabla L$. Without understanding the gradient you cannot explain why Adam outperforms SGD, why batch normalization stabilizes training, or why vanishing gradients kill deep networks.
- Neural network weight optimization: $w \leftarrow w - \alpha \nabla_w L$
- Drone navigation: air pressure gradient points toward altitude gain
- Computer graphics: surface normal $\nabla f / |\nabla f|$ for shading and lighting
- MRI: gradient fields encode spatial position in magnetic resonance imaging
- Economics: steepest-ascent direction in the profit parameter space
Цели урока
- Compute the gradient of a function and interpret it as the direction of steepest ascent
- Find the directional derivative using the dot product with a unit vector
- Analyze gradient descent convergence via the Lipschitz condition and $O(1/k)$ rate
Предварительные знания
- Partial derivatives of multivariable functions
- Dot product and angle between vectors
- Vector norms
Gradient: the vector of partial derivatives
The gradient $\nabla f = \left(\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right)$ points in the direction of steepest ascent of $f$. Its magnitude $|\nabla f|$ is the rate of increase in that direction. Perpendicular to $\nabla f$, the function does not change - that is the direction along a level curve.
Directional derivative
The directional derivative of $f$ along unit vector $\mathbf{v}$: $D_{\mathbf{v}} f = \nabla f \cdot \mathbf{v} = |\nabla f|\cos\theta$. Maximum when $\mathbf{v} = \nabla f / |\nabla f|$, zero when $\theta = \pi/2$ - along the level curve.
Gradient descent convergence
For an $L$-smooth function (Lipschitz gradient constant $L$), step size $\alpha = 1/L$ guarantees loss decrease at every iteration. Convergence rate $O(1/k)$ for convex functions: $f(x_k) - f^* \leq \frac{L \|x_0 - x^*\|^2}{2k}$.
Stochastic SGD estimates $\nabla f$ from a mini-batch. For convergence you need decreasing step sizes satisfying $\sum \alpha_k = \infty$ and $\sum \alpha_k^2 < \infty$ (Robbins-Monro conditions).
The Gradient: Steepest Ascent in Parameter Space
The Adam optimizer running GPT-4 (175 billion parameters) moves the entire parameter vector theta in the direction of the negative gradient at every training step. The gradient of the loss surface encodes not just the magnitude of change for each parameter, but the direction of the sharpest increase in loss. Moving opposite to that direction decreases loss fastest.
The gradient is a vector field: at every point in the domain, it assigns a vector that points in the direction of steepest ascent of f. Its magnitude is the rate of increase in that direction. Level curves of f are always perpendicular to the gradient at each point.
In a weight matrix W of shape (768, 3072) inside a transformer, the gradient of the loss with respect to W has exactly the same shape. Each element grad_W[i,j] = dL/dW[i,j] is a partial derivative computed by backpropagation. The optimizer uses the entire gradient tensor to update W.
For f(x, y, z) = x^2*y + yz^3, compute the gradient at the point (1, 2, -1). What is its magnitude?
The gradient points in the direction of steepest increase of the function and is the vector of its partial derivatives.
Directional Derivative and the Meaning of Gradient
In reinforcement learning, a policy gradient update moves parameters in the direction that maximally increases expected reward. The directional derivative quantifies how fast a function changes along any specific direction, not just along coordinate axes. The gradient is the unique vector that encodes all directional derivatives simultaneously.
The directional derivative formula D_u f = grad_f dot u explains why the negative gradient is the optimal descent direction: cos(theta) = -1 when u points exactly opposite to the gradient, giving the most negative (fastest decreasing) directional derivative. Any other direction decreases f more slowly.
For f(x,y) = e^x * cos(y), find the directional derivative at (0, 0) in the direction of the vector (1, 1).
The answer follows directly from the definition and properties of the object under consideration.
Gradient Descent: Algorithm, Convergence, and Learning Rate
Training ResNet-50 on ImageNet runs approximately 300,000 gradient descent steps. Each step subtracts a scaled gradient from the current parameter vector. The learning rate controls the step size and is the most critical hyperparameter: too large and the iterate diverges, too small and training takes prohibitively long. For strongly convex functions, the convergence rate is geometric.
Adam (Adaptive Moment Estimation) modifies plain gradient descent by maintaining running estimates of the first moment (mean gradient) and second moment (uncentered variance). The effective learning rate for each parameter is scaled by the inverse square root of its second moment, giving larger steps for infrequent gradients and smaller steps for frequent large-gradient parameters. This is why Adam trains transformers faster than SGD with a constant learning rate.
For L(x,y) = 3x^2 + y^2, starting at (2, 4) with learning rate alpha = 0.1, compute the first two gradient descent steps. Where does the trajectory converge?
The gradient points in the direction of steepest increase of the function and is the vector of its partial derivatives.
Geometric meaning on the surface $f(x,y) = x^2 + 4y^2$
$\nabla f = (2x, 8y)$. At $(1, 0.5)$: $\nabla f = (2, 4)$ - the direction of steepest ascent. Level curves are ellipses. The gradient is always perpendicular to the tangent of the ellipse at that point.
Итоги
- The gradient $\nabla f$ points in the direction of steepest ascent; its magnitude equals the rate of increase
- Directional derivative $D_{\mathbf{v}} f = \nabla f \cdot \mathbf{v}$: maximum along the gradient, zero along level curves
- Gradient descent converges in $O(1/k)$ steps with step size $\alpha \leq 1/L$
Connections to other topics
The gradient is the building block for second-order methods: the Hessian $H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$ encodes curvature and is used in Newton's method. In vector calculus, the gradient generalizes to divergence and curl, which appear in Stokes' theorem.
- Related topics — extends
Вопросы для размышления
- Why do Adam and RMSProp outperform SGD? What additional information about the gradient do they exploit?
- Vanishing gradients: what happens to $|\nabla f|$ in deep networks, and why does this block early-layer learning?
- If the loss is non-convex (as in neural networks), gradient descent finds a local minimum. Why does this turn out to be good enough in practice?