Calculus

Functions of Several Variables

Цели урока

Understand why functions of many variables are central to ML and optimization
Interpret surfaces, paraboloids, and saddle points
Read level curves as a tool for loss landscape analysis
Determine when a multivariable limit exists and when it does not

Предварительные знания

What a function is (input -> output)
Limits - where a function tends

GPT-4's loss function takes 1.8 trillion parameters and returns one number. Every Adam step updates all 1.8T simultaneously. That is multivariable calculus in production at a scale that would have seemed impossible 20 years ago.

GPT-4: L(theta) with theta in R^{1.8T} - function of 1.8 trillion variables
ResNet-50: backward pass computes 25M partial derivatives in one pass
Adam optimizer: momentum in parameter space = gradient descent on a surface
Li et al. 2018: loss landscape of ResNet visualized as a 3D surface
Kaggle: hyperparameter tuning = minimizing f(lr, dropout, batch_size, ...)

GPT-4 and 1.8 Trillion Variables

The GPT-4 loss function is $L(\theta)$ where $\theta \in \mathbb{R}^{1.8\text{T}}$. The gradient $\nabla L$ is a vector of 1.8 trillion numbers. Every Adam step updates all 1.8T parameters simultaneously. That is a function of several variables at a scale unimaginable 20 years ago.

ResNet-50 processes an input image of $224 \times 224 \times 3 = 150{,}528$ pixels - a function of 150,528 variables. The backward pass computes 25 million partial derivatives in a single pass.

Which situation does NOT require a function of several variables?

Temperature conversion is a function of ONE variable: $F = 1.8C + 32$. All others require two or more inputs.

Surfaces: The Loss Landscape of Neural Networks

The graph of $f(x,y)$ is a surface in 3D. The loss surface of a neural network with two parameters looks like a mountain landscape with valleys (local minima), passes (saddle points), and peaks. Researchers at Facebook AI in 2018 literally rendered such a landscape for ResNet-56.

Paraboloid f(x,y) = x^2 + y^2: - f(0,0) = 0 (minimum) - f(1,0) = 1, f(0,2) = 4, f(1,1) = 2 - Shape: satellite dish / Adam loss bowl Saddle f(x,y) = x^2 - y^2: - f(0,0) = 0 (not min, not max) - Along x: x^2 increases -> minimum - Along y: -y^2 decreases -> maximum - (0,0) is a saddle point

Saddle points are the main challenge in neural network optimization: gradient is zero there, just as at a minimum, but the point is not a minimum. Adam optimizer uses momentum to escape saddle points.

Summarise the key idea of: Surfaces and Loss Landscape.

Level Curves: Reading the Loss Landscape

TensorBoard shows loss curves - one variable (time). But the loss landscape is a multidimensional surface. Researchers in Li et al. (2018) project it onto 2D along two random directions and draw level curves. Dense contours mean steep slopes and unstable training.

Level curve c: all points where x^2 + y^2 = c This is a circle of radius sqrt(c). c = 1: r = 1.0 (unit circle) c = 4: r = 2.0 c = 9: r = 3.0 Level curves = concentric circles. Closer to center -> denser circles -> steeper slope.

The gradient is always perpendicular to level curves. Gradient descent moves along this perpendicular - which is why it follows the steepest path downhill to the minimum.

Dense level curves appear on a loss landscape visualization. What does this indicate?

Dense level curves mean large change over a small distance, which means a large gradient. That produces unstable gradient descent - exactly why gradient clipping is used at steep slopes.

Limits: Path Matters

In one dimension a point can be approached from only two directions: left or right. In two dimensions there are **infinitely many paths**: along any line, parabola, or spiral. A limit exists only if every path gives the same value.

Along x-axis (y=0): x*0/(x^2+0) = 0 Along y-axis (x=0): 0*y/(0+y^2) = 0 Seems like the limit is 0. But: Along y=x: x*x/(x^2+x^2) = x^2/(2x^2) = 1/2 Axes give 0, diagonal gives 1/2. The limit does not exist.

Even if every straight-line approach gives the same answer, the limit may still fail to exist - curved paths must also be checked. In ML this explains why a loss surface may look smooth along one direction yet behave unexpectedly in another.

Why are multivariable limits harder than one-dimensional limits?

On the line - two paths. On the plane - infinitely many. Every path must give the same answer for the limit to exist.

Practice

Three problems: from concrete computation to limit analysis.

L(w, b) = (w*x + b - y)^2 at x=2, y=3 L(1.0, 0.0) = (1*2 + 0 - 3)^2 = (-1)^2 = 1.0 L(1.5, 0.0) = (1.5*2 + 0 - 3)^2 = 0^2 = 0.0 (optimal!) L(1.0, 1.0) = (1*2 + 1 - 3)^2 = 0^2 = 0.0 (also optimal)

Level curve c: x + 2y = c, or y = (c-x)/2 These are lines with slope -1/2. - c = 0: line y = -x/2 - c = 2: line y = 1 - x/2 - c = -2: line y = -1 - x/2 Parallel lines. Surface z = x+2y is a tilted plane. Gradient: (1, 2) - perpendicular to every level curve.

Along x-axis (y=0): x^2/x^2 = 1 Along y-axis (x=0): -y^2/y^2 = -1 1 != -1 => the limit does not exist. Geometric sense: this function is constant on rays from the origin but takes different values on different rays.

Summarise the key idea of: Practice.

Where This Topic Leads

Functions of several variables are the foundation of everything ahead in ML:

Partial Derivatives — How fast does the loss change along each parameter?
Gradient and Gradient Descent — Direction of steepest loss decrease - the foundation of neural network training
Optimization — Finding the minimum of loss in a space of millions of parameters
Taylor Series — Multivariable Taylor: quadratic loss approximation is the basis of L-BFGS

Итоги

A function $f: \mathbb{R}^n \to \mathbb{R}$ takes multiple inputs, returns one output; neural network loss is exactly this
The graph of $f(x,y)$ is a surface: paraboloid (bowl), saddle - the key shapes in optimization
Level curves are 'slices' of the surface; dense curves mean large gradient and unstable training
The gradient $\nabla f$ is perpendicular to level curves and points in the direction of steepest ascent
A limit exists only when every path to the point gives the same value

Вопросы для размышления

Why are saddle points harder for gradient descent than local minima?
How did Li et al. visualize ResNet's loss landscape in 2D?
Why is gradient clipping applied at steep loss slopes?
If the limit along every straight line is the same, does the limit necessarily exist?

Связанные уроки

stats-21