Calculus
Functions of Several Variables
Цели урока
- Understand why functions of many variables are central to ML and optimization
- Interpret surfaces, paraboloids, and saddle points
- Read level curves as a tool for loss landscape analysis
- Determine when a multivariable limit exists and when it does not
Предварительные знания
- What a function is (input -> output)
- Limits - where a function tends
GPT-4's loss function takes 1.8 trillion parameters and returns one number. Every Adam step updates all 1.8T simultaneously. That is multivariable calculus in production at a scale that would have seemed impossible 20 years ago.
- GPT-4: L(theta) with theta in R^{1.8T} - function of 1.8 trillion variables
- ResNet-50: backward pass computes 25M partial derivatives in one pass
- Adam optimizer: momentum in parameter space = gradient descent on a surface
- Li et al. 2018: loss landscape of ResNet visualized as a 3D surface
- Kaggle: hyperparameter tuning = minimizing f(lr, dropout, batch_size, ...)
GPT-4 and 1.8 Trillion Variables
The GPT-4 loss function is $L(\theta)$ where $\theta \in \mathbb{R}^{1.8\text{T}}$. The gradient $\nabla L$ is a vector of 1.8 trillion numbers. Every Adam step updates all 1.8T parameters simultaneously. That is a function of several variables at a scale unimaginable 20 years ago.
ResNet-50 processes an input image of $224 \times 224 \times 3 = 150{,}528$ pixels - a function of 150,528 variables. The backward pass computes 25 million partial derivatives in a single pass.
Which situation does NOT require a function of several variables?
Temperature conversion is a function of ONE variable: $F = 1.8C + 32$. All others require two or more inputs.
Surfaces: The Loss Landscape of Neural Networks
The graph of $f(x,y)$ is a surface in 3D. The loss surface of a neural network with two parameters looks like a mountain landscape with valleys (local minima), passes (saddle points), and peaks. Researchers at Facebook AI in 2018 literally rendered such a landscape for ResNet-56.
Paraboloid f(x,y) = x^2 + y^2: - f(0,0) = 0 (minimum) - f(1,0) = 1, f(0,2) = 4, f(1,1) = 2 - Shape: satellite dish / Adam loss bowl Saddle f(x,y) = x^2 - y^2: - f(0,0) = 0 (not min, not max) - Along x: x^2 increases -> minimum - Along y: -y^2 decreases -> maximum - (0,0) is a saddle point
Saddle points are the main challenge in neural network optimization: gradient is zero there, just as at a minimum, but the point is not a minimum. Adam optimizer uses momentum to escape saddle points.
Summarise the key idea of: Surfaces and Loss Landscape.
Level Curves: Reading the Loss Landscape
TensorBoard shows loss curves - one variable (time). But the loss landscape is a multidimensional surface. Researchers in Li et al. (2018) project it onto 2D along two random directions and draw level curves. Dense contours mean steep slopes and unstable training.
Level curve c: all points where x^2 + y^2 = c This is a circle of radius sqrt(c). c = 1: r = 1.0 (unit circle) c = 4: r = 2.0 c = 9: r = 3.0 Level curves = concentric circles. Closer to center -> denser circles -> steeper slope.
The gradient is always perpendicular to level curves. Gradient descent moves along this perpendicular - which is why it follows the steepest path downhill to the minimum.
Dense level curves appear on a loss landscape visualization. What does this indicate?
Dense level curves mean large change over a small distance, which means a large gradient. That produces unstable gradient descent - exactly why gradient clipping is used at steep slopes.
Limits: Path Matters
In one dimension a point can be approached from only two directions: left or right. In two dimensions there are **infinitely many paths**: along any line, parabola, or spiral. A limit exists only if every path gives the same value.
Along x-axis (y=0): x*0/(x^2+0) = 0 Along y-axis (x=0): 0*y/(0+y^2) = 0 Seems like the limit is 0. But: Along y=x: x*x/(x^2+x^2) = x^2/(2x^2) = 1/2 Axes give 0, diagonal gives 1/2. The limit does not exist.
Even if every straight-line approach gives the same answer, the limit may still fail to exist - curved paths must also be checked. In ML this explains why a loss surface may look smooth along one direction yet behave unexpectedly in another.
Why are multivariable limits harder than one-dimensional limits?
On the line - two paths. On the plane - infinitely many. Every path must give the same answer for the limit to exist.
Practice
Three problems: from concrete computation to limit analysis.
L(w, b) = (w*x + b - y)^2 at x=2, y=3 L(1.0, 0.0) = (1*2 + 0 - 3)^2 = (-1)^2 = 1.0 L(1.5, 0.0) = (1.5*2 + 0 - 3)^2 = 0^2 = 0.0 (optimal!) L(1.0, 1.0) = (1*2 + 1 - 3)^2 = 0^2 = 0.0 (also optimal)
Level curve c: x + 2y = c, or y = (c-x)/2 These are lines with slope -1/2. - c = 0: line y = -x/2 - c = 2: line y = 1 - x/2 - c = -2: line y = -1 - x/2 Parallel lines. Surface z = x+2y is a tilted plane. Gradient: (1, 2) - perpendicular to every level curve.
Along x-axis (y=0): x^2/x^2 = 1 Along y-axis (x=0): -y^2/y^2 = -1 1 != -1 => the limit does not exist. Geometric sense: this function is constant on rays from the origin but takes different values on different rays.
Summarise the key idea of: Practice.
Where This Topic Leads
Functions of several variables are the foundation of everything ahead in ML:
- Partial Derivatives — How fast does the loss change along each parameter?
- Gradient and Gradient Descent — Direction of steepest loss decrease - the foundation of neural network training
- Optimization — Finding the minimum of loss in a space of millions of parameters
- Taylor Series — Multivariable Taylor: quadratic loss approximation is the basis of L-BFGS
Итоги
- A function $f: \mathbb{R}^n \to \mathbb{R}$ takes multiple inputs, returns one output; neural network loss is exactly this
- The graph of $f(x,y)$ is a surface: paraboloid (bowl), saddle - the key shapes in optimization
- Level curves are 'slices' of the surface; dense curves mean large gradient and unstable training
- The gradient $\nabla f$ is perpendicular to level curves and points in the direction of steepest ascent
- A limit exists only when every path to the point gives the same value
Вопросы для размышления
- Why are saddle points harder for gradient descent than local minima?
- How did Li et al. visualize ResNet's loss landscape in 2D?
- Why is gradient clipping applied at steep loss slopes?
- If the limit along every straight line is the same, does the limit necessarily exist?