Calculus

The Concept of the Derivative

Цели урока

  • Understand the derivative as a limit of an incremental ratio
  • Interpret the derivative geometrically (tangent line) and physically (velocity)
  • Compute derivatives from the definition
  • Distinguish differentiability from continuity
  • Understand why ReLU is non-differentiable at zero and what that implies

Предварительные знания

  • The concept of a function limit
  • Computing limits
  • Continuity of a function
  • The Concept of a Limit
  • Computing Limits
  • Continuity of a Function

Adam, Adagrad, RMSProp, SGD - all neural network optimizers share one thing: they need the gradient. The gradient is a vector of derivatives. And the derivative - the limit of an incremental ratio - first appeared in Newton's notebooks in 1665, when he needed to explain why planets move in ellipses. Nothing has fundamentally changed since: PyTorch autograd does the same thing, only faster.

  • **Gradient descent**: every weight update $w \leftarrow w - \alpha \nabla L$ is a step along the derivative of the loss function. Without the derivative, optimization is impossible
  • **Autograd**: PyTorch builds a computation graph and applies the chain rule - repeated differentiation from the definition
  • **ReLU**: $\text{ReLU}(x) = \max(0, x)$ is not differentiable at $x = 0$ (a corner point, like $|x|$). In practice a subgradient is used: $f'(0) = 0$ or $f'(0) = 1$
  • **Learning rate**: too large - the gradient 'overshoots', too small - convergence is slow. The derivative indicates the direction, the learning rate controls the step size

Two geniuses, one dispute

**Isaac Newton** created his 'method of fluxions' around 1665 to solve problems in mechanics - he needed to describe planetary motion. **Gottfried Leibniz** independently developed calculus in 1684 with more convenient notation ($\frac{dy}{dx}$). Their followers staged a years-long dispute over priority, dividing mathematicians of England and the Continent. Everyone won: we have both notations today.

Definition of the derivative

Definition of the derivative

The derivative of $f$ at $x$ is the **limit of the ratio of the function's increment to the argument's increment**:

The fraction $\frac{f(x+h) - f(x)}{h}$ is the **difference quotient** - the average rate of change over $[x, x+h]$. Taking $h \to 0$ gives the **instantaneous** rate of change.

The derivative exists if this limit **exists and is finite**. Then the function is called **differentiable** at $x$. This is exactly what PyTorch autograd verifies when building the computation graph.

What does the difference quotient $\frac{f(x+h) - f(x)}{h}$ represent?

The difference quotient is the average rate over an interval. The derivative - the instantaneous rate - is obtained as the limit when $h \to 0$.

Geometric meaning

Geometric meaning

The difference quotient is the **slope of the secant line** through $(x, f(x))$ and $(x+h, f(x+h))$. As $h \to 0$ the secant becomes the **tangent line**:

Equation of the tangent line

Through the point of tangency and the slope

The tangent to $f(x)$ at point $a$: $$y = f(a) + f'(a)(x - a)$$ **Example**: tangent to $y = x^2$ at $x = 2$: - $f(2) = 4$ - tangent point $(2, 4)$ - $f'(x) = 2x$, so $f'(2) = 4$ - slope Equation: $y = 4 + 4(x - 2) = 4x - 4$ In ML: the tangent to the loss at the current weights is the linear approximation that gradient descent uses to take a step.

The tangent line intersects the graph at only one point

The tangent line may intersect the graph at other points too

The tangent is defined via the limit of secants, not by the number of intersections. The tangent to $y = x^3$ at 0 is the $x$-axis, which intersects the graph at the same point.

Geometrically, what does $f'(a)$ represent?

The derivative is the limit of secant slopes as the second point approaches $(a, f(a))$ - the tangent slope.

Physical meaning

Physical meaning

If $s(t)$ is the position of an object at time $t$, then:

  • $v(t) = s'(t)$ - **instantaneous velocity**
  • $a(t) = v'(t) = s''(t)$ - **acceleration** (second derivative)

Free fall

s(t) = 4.9t²

$s(t) = 4.9t^2$ meters. **Velocity**: $v(t) = s'(t) = 9.8t$ m/s **Acceleration**: $a(t) = v'(t) = 9.8$ m/s² (constant - this is $g$!) After 3 seconds: $s(3) = 44.1$ m, $v(3) = 29.4$ m/s.

If the velocity of a car is $v(t) = 20 + 2t$ m/s, what is the acceleration?

Acceleration is the derivative of velocity: $a = v'(t) = (20 + 2t)' = 2$ m/s². Speed increases by 2 m/s every second.

Computing from the definition

Computing from the definition

Derivative of x²

The classic example - the one Newton computed first

Find $(x^2)'$ from the definition: $$f'(x) = \lim_{h \to 0} \frac{(x+h)^2 - x^2}{h}$$ Expand: $$= \lim_{h \to 0} \frac{x^2 + 2xh + h^2 - x^2}{h} = \lim_{h \to 0} \frac{2xh + h^2}{h}$$ Cancel $h$: $$= \lim_{h \to 0} (2x + h) = 2x$$ $(x^2)' = 2x$. The slope of the parabola grows linearly with $x$.

Derivative of 1/x

Negative power

$f(x) = \frac{1}{x}$ $$f'(x) = \lim_{h \to 0} \frac{\frac{1}{x+h} - \frac{1}{x}}{h} = \lim_{h \to 0} \frac{x - (x+h)}{h \cdot x(x+h)}$$ $$= \lim_{h \to 0} \frac{-1}{x(x+h)} = -\frac{1}{x^2}$$ $(1/x)' = -1/x^2$. The function decreases - derivative is negative.

When computing $(x^3)'$ from the definition, what expression needs to be taken as a limit?

By definition: $f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$. For $f(x) = x^3$ this is $\frac{(x+h)^3 - x^3}{h}$.

Table of derivatives

Table of derivatives

These derivatives must be **memorized** - they are the foundation of all differentiation:

$f(x)$$f'(x)$Comment
$c$ (constant)$0$A constant does not change
$x^n$$nx^{n-1}$Power rule
$e^x$$e^x$Unique: equals itself!
$a^x$$a^x \ln a$Exponential function
$\ln x$$\frac{1}{x}$Natural logarithm
$\log_a x$$\frac{1}{x \ln a}$Logarithm
$\sin x$$\cos x$Shift by $\pi/2$
$\cos x$$-\sin x$Minus!
$\tan x$$\frac{1}{\cos^2 x}$Or $1 + \tan^2 x$

**$e^x$ equals its own derivative** - not a coincidence. That's why $e^x$ appears in solutions to differential equations (neural networks learn via ODEs), and why softmax is built on $e^x$.

What is the derivative of $\ln(x)$?

Standard derivative: $\frac{d}{dx}\ln(x) = 1/x$ for $x > 0$.

Derivative notation

Derivative notation

NotationAuthorWhen used
$f'(x)$, $y'$LagrangeEverywhere, convenient for functions
$\frac{df}{dx}$, $\frac{dy}{dx}$LeibnizEmphasizes 'with respect to what'
$\dot{y}$, $\ddot{y}$NewtonPhysics (derivative with respect to time)
$D_x f$, $Df$OperatorFunctional analysis

Leibniz notation $\frac{dy}{dx}$ became the ML standard: $\frac{\partial L}{\partial w}$ - the gradient of the loss with respect to the weights - is exactly Leibniz notation.

Which notations all mean the derivative of $y = f(x)$?

All four notations (Lagrange, Leibniz, Newton, Euler) denote the same derivative. Each has historical and contextual usage.

Differentiability vs continuity

Differentiability vs continuity

A function is **differentiable** at a point if the derivative exists (limit is finite). Key relationship:

**The converse is false.** A function can be continuous but not differentiable - that's exactly how ReLU behaves at 0.

ReLU and |x|

Continuous but not differentiable at zero

$f(x) = |x|$ is continuous everywhere. At $x = 0$: - From the left: $\lim_{h \to 0^-} \frac{|h|}{h} = \frac{-h}{h} = -1$ - From the right: $\lim_{h \to 0^+} \frac{|h|}{h} = \frac{h}{h} = +1$ Limits **differ** - no derivative. A corner point. $\text{ReLU}(x) = \max(0, x)$ - same story at $x = 0$. PyTorch uses a subgradient of 0 or 1 by convention. Training works in practice because the measure of a single point is zero.

If a function is continuous - it is differentiable

Continuity is necessary but not sufficient for differentiability

ReLU, $|x|$, $\sqrt[3]{x}$ at zero - continuous but not differentiable. In 1872 Weierstrass constructed a function that is continuous everywhere and differentiable nowhere.

Which statement is true?

$|x|$ is continuous at 0 but not differentiable there. Differentiability requires a unique tangent, a stronger condition than continuity.

Practice

Practice

Compute $(x^3)'$ from the definition of the derivative

$$f'(x) = \lim_{h \to 0} \frac{(x+h)^3 - x^3}{h}$$ $(x+h)^3 = x^3 + 3x^2h + 3xh^2 + h^3$: $$= \lim_{h \to 0} \frac{3x^2h + 3xh^2 + h^3}{h} = \lim_{h \to 0} (3x^2 + 3xh + h^2) = 3x^2$$ $(x^3)' = 3x^2$ - confirms the formula $nx^{n-1}$.

Find the equation of the tangent line to $y = \sqrt{x}$ at $x = 4$

Point of tangency: $f(4) = 2$, point $(4, 2)$. Slope: $f'(4) = \frac{1}{2\sqrt{4}} = \frac{1}{4}$. Tangent line equation: $$y = 2 + \frac{1}{4}(x - 4) = \frac{1}{4}x + 1$$

Prove that $f(x) = x|x|$ is differentiable at $x = 0$

$f(x) = \begin{cases} x^2 & x \geq 0 \\ -x^2 & x < 0 \end{cases}$ **From the right**: $\lim_{h \to 0^+} \frac{h^2}{h} = 0$ **From the left**: $\lim_{h \to 0^-} \frac{-h^2}{h} = \lim_{h \to 0^-} (-h) = 0$ Limits **coincide**: $f'(0) = 0$ - the function is differentiable. Unlike $|x|$, the function $x|x|$ 'smooths out' the corner. Analogy: $\text{SiLU}(x) = x \cdot \sigma(x)$ - a smooth product.

Compute $f'(2)$ for $f(x) = x^3$.

$f'(x) = 3x^2$, so $f'(2) = 3 \cdot 4 = 12$.

Connection with other topics

The derivative is the central concept of calculus

  • Differentiation rules — Next lesson: how to compute quickly without returning to the definition
  • Continuity — Differentiability implies continuity - but ReLU shows the converse fails
  • Integral — The integral is the inverse operation of differentiation (Fundamental Theorem of Calculus)
  • Optimization — Derivative = 0 at extrema - the foundation of gradient descent

Итоги

  • Derivative $f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$ - instantaneous rate of change
  • Geometrically: slope of the tangent line. In ML: direction of gradient descent
  • Physically: velocity (first derivative), acceleration (second)
  • Differentiability implies continuity (but not conversely - ReLU is the example)
  • Table: $(x^n)' = nx^{n-1}$, $(e^x)' = e^x$, $(\sin x)' = \cos x$

Вопросы для размышления

  • Why is $e^x$ the only function equal to its own derivative? How does this connect to solving the ODE $y' = y$?
  • ReLU is not differentiable at zero, yet PyTorch trains neural networks with ReLU. Why does this work?
  • What would happen to gradient descent if the loss function had a discontinuity?

Связанные уроки

  • ml-09-gradient-descent
  • stat-03-mle
The Concept of the Derivative

0

1

Sign In