Calculus

Functions of Several Variables

Цели урока

  • Understand why functions of many variables are central to ML and optimization
  • Interpret surfaces, paraboloids, and saddle points
  • Read level curves as a tool for loss landscape analysis
  • Determine when a multivariable limit exists and when it does not

Предварительные знания

  • What a function is (input -> output)
  • Limits - where a function tends
  • The Concept of a Limit
  • Continuity of a Function

GPT-4's loss function takes 1.8 trillion parameters and returns one number. Every Adam step updates all 1.8T simultaneously. That is multivariable calculus in production at a scale that would have seemed impossible 20 years ago.

  • GPT-4: L(theta) with theta in R^{1.8T} - function of 1.8 trillion variables
  • ResNet-50: backward pass computes 25M partial derivatives in one pass
  • Adam optimizer: momentum in parameter space = gradient descent on a surface
  • Li et al. 2018: loss landscape of ResNet visualized as a 3D surface
  • Kaggle: hyperparameter tuning = minimizing f(lr, dropout, batch_size, ...)

GPT-4 and 1.8 Trillion Variables

The GPT-4 loss function is $L(\theta)$ where $\theta \in \mathbb{R}^{1.8\text{T}}$. The gradient $\nabla L$ is a vector of 1.8 trillion numbers. Every Adam step updates all 1.8T parameters simultaneously. That is a function of several variables at a scale unimaginable 20 years ago.

ResNet-50 processes an input image of $224 \times 224 \times 3 = 150{,}528$ pixels - a function of 150,528 variables. The backward pass computes 25 million partial derivatives in a single pass.

Which situation does NOT require a function of several variables?

Temperature conversion is a function of ONE variable: $F = 1.8C + 32$. All others require two or more inputs.

Surfaces: The Loss Landscape of Neural Networks

The graph of $f(x,y)$ is a surface in 3D. The loss surface of a neural network with two parameters looks like a mountain landscape with valleys (local minima), passes (saddle points), and peaks. Researchers at Facebook AI in 2018 literally rendered such a landscape for ResNet-56.

Paraboloid f(x,y) = x^2 + y^2: - f(0,0) = 0 (minimum) - f(1,0) = 1, f(0,2) = 4, f(1,1) = 2 - Shape: satellite dish / Adam loss bowl Saddle f(x,y) = x^2 - y^2: - f(0,0) = 0 (not min, not max) - Along x: x^2 increases -> minimum - Along y: -y^2 decreases -> maximum - (0,0) is a saddle point

Saddle points are the main challenge in neural network optimization: gradient is zero there, just as at a minimum, but the point is not a minimum. Adam optimizer uses momentum to escape saddle points.

Summarise the key idea of: Surfaces and Loss Landscape.

Level Curves: Reading the Loss Landscape

TensorBoard shows loss curves - one variable (time). But the loss landscape is a multidimensional surface. Researchers in Li et al. (2018) project it onto 2D along two random directions and draw level curves. Dense contours mean steep slopes and unstable training.

Level curve c: all points where x^2 + y^2 = c This is a circle of radius sqrt(c). c = 1: r = 1.0 (unit circle) c = 4: r = 2.0 c = 9: r = 3.0 Level curves = concentric circles. Closer to center -> denser circles -> steeper slope.

The gradient is always perpendicular to level curves. Gradient descent moves along this perpendicular - which is why it follows the steepest path downhill to the minimum.

Dense level curves appear on a loss landscape visualization. What does this indicate?

Dense level curves mean large change over a small distance, which means a large gradient. That produces unstable gradient descent - exactly why gradient clipping is used at steep slopes.

Limits: Path Matters

In one dimension a point can be approached from only two directions: left or right. In two dimensions there are **infinitely many paths**: along any line, parabola, or spiral. A limit exists only if every path gives the same value.

Along x-axis (y=0): x*0/(x^2+0) = 0 Along y-axis (x=0): 0*y/(0+y^2) = 0 Seems like the limit is 0. But: Along y=x: x*x/(x^2+x^2) = x^2/(2x^2) = 1/2 Axes give 0, diagonal gives 1/2. The limit does not exist.

Even if every straight-line approach gives the same answer, the limit may still fail to exist - curved paths must also be checked. In ML this explains why a loss surface may look smooth along one direction yet behave unexpectedly in another.

Why are multivariable limits harder than one-dimensional limits?

On the line - two paths. On the plane - infinitely many. Every path must give the same answer for the limit to exist.

Practice

Three problems: from concrete computation to limit analysis.

L(w, b) = (w*x + b - y)^2 at x=2, y=3 L(1.0, 0.0) = (1*2 + 0 - 3)^2 = (-1)^2 = 1.0 L(1.5, 0.0) = (1.5*2 + 0 - 3)^2 = 0^2 = 0.0 (optimal!) L(1.0, 1.0) = (1*2 + 1 - 3)^2 = 0^2 = 0.0 (also optimal)

Level curve c: x + 2y = c, or y = (c-x)/2 These are lines with slope -1/2. - c = 0: line y = -x/2 - c = 2: line y = 1 - x/2 - c = -2: line y = -1 - x/2 Parallel lines. Surface z = x+2y is a tilted plane. Gradient: (1, 2) - perpendicular to every level curve.

Along x-axis (y=0): x^2/x^2 = 1 Along y-axis (x=0): -y^2/y^2 = -1 1 != -1 => the limit does not exist. Geometric sense: this function is constant on rays from the origin but takes different values on different rays.

Summarise the key idea of: Practice.

Where This Topic Leads

Functions of several variables are the foundation of everything ahead in ML:

  • Partial Derivatives — How fast does the loss change along each parameter?
  • Gradient and Gradient Descent — Direction of steepest loss decrease - the foundation of neural network training
  • Optimization — Finding the minimum of loss in a space of millions of parameters
  • Taylor Series — Multivariable Taylor: quadratic loss approximation is the basis of L-BFGS

Итоги

  • A function $f: \mathbb{R}^n \to \mathbb{R}$ takes multiple inputs, returns one output; neural network loss is exactly this
  • The graph of $f(x,y)$ is a surface: paraboloid (bowl), saddle - the key shapes in optimization
  • Level curves are 'slices' of the surface; dense curves mean large gradient and unstable training
  • The gradient $\nabla f$ is perpendicular to level curves and points in the direction of steepest ascent
  • A limit exists only when every path to the point gives the same value

Вопросы для размышления

  • Why are saddle points harder for gradient descent than local minima?
  • How did Li et al. visualize ResNet's loss landscape in 2D?
  • Why is gradient clipping applied at steep loss slopes?
  • If the limit along every straight line is the same, does the limit necessarily exist?

Связанные уроки

  • stats-21
Functions of Several Variables

0

1

Sign In