Calculus
The Concept of the Derivative
Цели урока
- Understand the derivative as a limit of an incremental ratio
- Interpret the derivative geometrically (tangent line) and physically (velocity)
- Compute derivatives from the definition
- Distinguish differentiability from continuity
- Understand why ReLU is non-differentiable at zero and what that implies
Предварительные знания
- The concept of a function limit
- Computing limits
- Continuity of a function
Adam, Adagrad, RMSProp, SGD - all neural network optimizers share one thing: they need the gradient. The gradient is a vector of derivatives. And the derivative - the limit of an incremental ratio - first appeared in Newton's notebooks in 1665, when he needed to explain why planets move in ellipses. Nothing has fundamentally changed since: PyTorch autograd does the same thing, only faster.
- **Gradient descent**: every weight update $w \leftarrow w - \alpha \nabla L$ is a step along the derivative of the loss function. Without the derivative, optimization is impossible
- **Autograd**: PyTorch builds a computation graph and applies the chain rule - repeated differentiation from the definition
- **ReLU**: $\text{ReLU}(x) = \max(0, x)$ is not differentiable at $x = 0$ (a corner point, like $|x|$). In practice a subgradient is used: $f'(0) = 0$ or $f'(0) = 1$
- **Learning rate**: too large - the gradient 'overshoots', too small - convergence is slow. The derivative indicates the direction, the learning rate controls the step size
Two geniuses, one dispute
**Isaac Newton** created his 'method of fluxions' around 1665 to solve problems in mechanics - he needed to describe planetary motion. **Gottfried Leibniz** independently developed calculus in 1684 with more convenient notation ($\frac{dy}{dx}$). Their followers staged a years-long dispute over priority, dividing mathematicians of England and the Continent. Everyone won: we have both notations today.
Definition of the derivative
Definition of the derivative
The derivative of $f$ at $x$ is the **limit of the ratio of the function's increment to the argument's increment**:
The fraction $\frac{f(x+h) - f(x)}{h}$ is the **difference quotient** - the average rate of change over $[x, x+h]$. Taking $h \to 0$ gives the **instantaneous** rate of change.
The derivative exists if this limit **exists and is finite**. Then the function is called **differentiable** at $x$. This is exactly what PyTorch autograd verifies when building the computation graph.
What does the difference quotient $\frac{f(x+h) - f(x)}{h}$ represent?
The difference quotient is the average rate over an interval. The derivative - the instantaneous rate - is obtained as the limit when $h \to 0$.
Geometric meaning
Geometric meaning
The difference quotient is the **slope of the secant line** through $(x, f(x))$ and $(x+h, f(x+h))$. As $h \to 0$ the secant becomes the **tangent line**:
Equation of the tangent line
Through the point of tangency and the slope
The tangent to $f(x)$ at point $a$: $$y = f(a) + f'(a)(x - a)$$ **Example**: tangent to $y = x^2$ at $x = 2$: - $f(2) = 4$ - tangent point $(2, 4)$ - $f'(x) = 2x$, so $f'(2) = 4$ - slope Equation: $y = 4 + 4(x - 2) = 4x - 4$ In ML: the tangent to the loss at the current weights is the linear approximation that gradient descent uses to take a step.
The tangent line intersects the graph at only one point
The tangent line may intersect the graph at other points too
The tangent is defined via the limit of secants, not by the number of intersections. The tangent to $y = x^3$ at 0 is the $x$-axis, which intersects the graph at the same point.
Geometrically, what does $f'(a)$ represent?
The derivative is the limit of secant slopes as the second point approaches $(a, f(a))$ - the tangent slope.
Physical meaning
Physical meaning
If $s(t)$ is the position of an object at time $t$, then:
- $v(t) = s'(t)$ - **instantaneous velocity**
- $a(t) = v'(t) = s''(t)$ - **acceleration** (second derivative)
Free fall
s(t) = 4.9t²
$s(t) = 4.9t^2$ meters. **Velocity**: $v(t) = s'(t) = 9.8t$ m/s **Acceleration**: $a(t) = v'(t) = 9.8$ m/s² (constant - this is $g$!) After 3 seconds: $s(3) = 44.1$ m, $v(3) = 29.4$ m/s.
If the velocity of a car is $v(t) = 20 + 2t$ m/s, what is the acceleration?
Acceleration is the derivative of velocity: $a = v'(t) = (20 + 2t)' = 2$ m/s². Speed increases by 2 m/s every second.
Computing from the definition
Computing from the definition
Derivative of x²
The classic example - the one Newton computed first
Find $(x^2)'$ from the definition: $$f'(x) = \lim_{h \to 0} \frac{(x+h)^2 - x^2}{h}$$ Expand: $$= \lim_{h \to 0} \frac{x^2 + 2xh + h^2 - x^2}{h} = \lim_{h \to 0} \frac{2xh + h^2}{h}$$ Cancel $h$: $$= \lim_{h \to 0} (2x + h) = 2x$$ $(x^2)' = 2x$. The slope of the parabola grows linearly with $x$.
Derivative of 1/x
Negative power
$f(x) = \frac{1}{x}$ $$f'(x) = \lim_{h \to 0} \frac{\frac{1}{x+h} - \frac{1}{x}}{h} = \lim_{h \to 0} \frac{x - (x+h)}{h \cdot x(x+h)}$$ $$= \lim_{h \to 0} \frac{-1}{x(x+h)} = -\frac{1}{x^2}$$ $(1/x)' = -1/x^2$. The function decreases - derivative is negative.
When computing $(x^3)'$ from the definition, what expression needs to be taken as a limit?
By definition: $f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$. For $f(x) = x^3$ this is $\frac{(x+h)^3 - x^3}{h}$.
Table of derivatives
Table of derivatives
These derivatives must be **memorized** - they are the foundation of all differentiation:
| $f(x)$ | $f'(x)$ | Comment |
|---|---|---|
| $c$ (constant) | $0$ | A constant does not change |
| $x^n$ | $nx^{n-1}$ | Power rule |
| $e^x$ | $e^x$ | Unique: equals itself! |
| $a^x$ | $a^x \ln a$ | Exponential function |
| $\ln x$ | $\frac{1}{x}$ | Natural logarithm |
| $\log_a x$ | $\frac{1}{x \ln a}$ | Logarithm |
| $\sin x$ | $\cos x$ | Shift by $\pi/2$ |
| $\cos x$ | $-\sin x$ | Minus! |
| $\tan x$ | $\frac{1}{\cos^2 x}$ | Or $1 + \tan^2 x$ |
**$e^x$ equals its own derivative** - not a coincidence. That's why $e^x$ appears in solutions to differential equations (neural networks learn via ODEs), and why softmax is built on $e^x$.
What is the derivative of $\ln(x)$?
Standard derivative: $\frac{d}{dx}\ln(x) = 1/x$ for $x > 0$.
Derivative notation
Derivative notation
| Notation | Author | When used |
|---|---|---|
| $f'(x)$, $y'$ | Lagrange | Everywhere, convenient for functions |
| $\frac{df}{dx}$, $\frac{dy}{dx}$ | Leibniz | Emphasizes 'with respect to what' |
| $\dot{y}$, $\ddot{y}$ | Newton | Physics (derivative with respect to time) |
| $D_x f$, $Df$ | Operator | Functional analysis |
Leibniz notation $\frac{dy}{dx}$ became the ML standard: $\frac{\partial L}{\partial w}$ - the gradient of the loss with respect to the weights - is exactly Leibniz notation.
Which notations all mean the derivative of $y = f(x)$?
All four notations (Lagrange, Leibniz, Newton, Euler) denote the same derivative. Each has historical and contextual usage.
Differentiability vs continuity
Differentiability vs continuity
A function is **differentiable** at a point if the derivative exists (limit is finite). Key relationship:
**The converse is false.** A function can be continuous but not differentiable - that's exactly how ReLU behaves at 0.
ReLU and |x|
Continuous but not differentiable at zero
$f(x) = |x|$ is continuous everywhere. At $x = 0$: - From the left: $\lim_{h \to 0^-} \frac{|h|}{h} = \frac{-h}{h} = -1$ - From the right: $\lim_{h \to 0^+} \frac{|h|}{h} = \frac{h}{h} = +1$ Limits **differ** - no derivative. A corner point. $\text{ReLU}(x) = \max(0, x)$ - same story at $x = 0$. PyTorch uses a subgradient of 0 or 1 by convention. Training works in practice because the measure of a single point is zero.
If a function is continuous - it is differentiable
Continuity is necessary but not sufficient for differentiability
ReLU, $|x|$, $\sqrt[3]{x}$ at zero - continuous but not differentiable. In 1872 Weierstrass constructed a function that is continuous everywhere and differentiable nowhere.
Which statement is true?
$|x|$ is continuous at 0 but not differentiable there. Differentiability requires a unique tangent, a stronger condition than continuity.
Practice
Practice
Compute $(x^3)'$ from the definition of the derivative
$$f'(x) = \lim_{h \to 0} \frac{(x+h)^3 - x^3}{h}$$ $(x+h)^3 = x^3 + 3x^2h + 3xh^2 + h^3$: $$= \lim_{h \to 0} \frac{3x^2h + 3xh^2 + h^3}{h} = \lim_{h \to 0} (3x^2 + 3xh + h^2) = 3x^2$$ $(x^3)' = 3x^2$ - confirms the formula $nx^{n-1}$.
Find the equation of the tangent line to $y = \sqrt{x}$ at $x = 4$
Point of tangency: $f(4) = 2$, point $(4, 2)$. Slope: $f'(4) = \frac{1}{2\sqrt{4}} = \frac{1}{4}$. Tangent line equation: $$y = 2 + \frac{1}{4}(x - 4) = \frac{1}{4}x + 1$$
Prove that $f(x) = x|x|$ is differentiable at $x = 0$
$f(x) = \begin{cases} x^2 & x \geq 0 \\ -x^2 & x < 0 \end{cases}$ **From the right**: $\lim_{h \to 0^+} \frac{h^2}{h} = 0$ **From the left**: $\lim_{h \to 0^-} \frac{-h^2}{h} = \lim_{h \to 0^-} (-h) = 0$ Limits **coincide**: $f'(0) = 0$ - the function is differentiable. Unlike $|x|$, the function $x|x|$ 'smooths out' the corner. Analogy: $\text{SiLU}(x) = x \cdot \sigma(x)$ - a smooth product.
Compute $f'(2)$ for $f(x) = x^3$.
$f'(x) = 3x^2$, so $f'(2) = 3 \cdot 4 = 12$.
Connection with other topics
The derivative is the central concept of calculus
- Differentiation rules — Next lesson: how to compute quickly without returning to the definition
- Continuity — Differentiability implies continuity - but ReLU shows the converse fails
- Integral — The integral is the inverse operation of differentiation (Fundamental Theorem of Calculus)
- Optimization — Derivative = 0 at extrema - the foundation of gradient descent
Итоги
- Derivative $f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$ - instantaneous rate of change
- Geometrically: slope of the tangent line. In ML: direction of gradient descent
- Physically: velocity (first derivative), acceleration (second)
- Differentiability implies continuity (but not conversely - ReLU is the example)
- Table: $(x^n)' = nx^{n-1}$, $(e^x)' = e^x$, $(\sin x)' = \cos x$
Вопросы для размышления
- Why is $e^x$ the only function equal to its own derivative? How does this connect to solving the ODE $y' = y$?
- ReLU is not differentiable at zero, yet PyTorch trains neural networks with ReLU. Why does this work?
- What would happen to gradient descent if the loss function had a discontinuity?