Calculus
Differentiation Rules
Цели урока
- Apply the linearity rule (sum, constant)
- Use the product rule $(fg)' = f'g + fg'$
- Use the quotient rule $(f/g)' = (f'g - fg')/g^2$
- Compute higher-order derivatives
- Understand the connection between differentiation rules and backpropagation
Предварительные знания
- The concept of the derivative and its definition
- Table of derivatives of elementary functions
1684. Leibniz publishes the product rule $(uv)' = u'v + uv'$ in 'Nova Methodus'. He calls it 'a consequence of infinitely small increments'. Today every backward pass in PyTorch applies this rule to every multiplication in the neural network - hundreds of thousands of times per second. Three rules from 1684 - linearity, product, quotient - make differentiation mechanical and fast.
- **Backpropagation**: the gradient of the loss is computed through differentiation rules. Attention in transformers is a product $QK^T / \sqrt{d}$, and the product rule applies when differentiating it
- **JAX and automatic differentiation**: `jax.grad` decomposes a computation into primitive operations and applies the rules - exactly what is done in this lesson, but symbolically
- **Second derivative**: $f''$ determines convexity. Adam uses the second moment of the gradient as an approximation of $f''$ for adaptive step size
- **Optimization**: derivative = 0 at an extremum (necessary condition). Second derivative > 0 guarantees a minimum - Newton's method uses both
Leibniz's rule
The product rule and the notation $\frac{dy}{dx}$ were introduced by **Leibniz** in his 'Nova Methodus' (1684). The notation proved so elegant it is still used today. Interestingly, Leibniz derived $(uv)' = u'v + uv'$ as a consequence of infinitely small increments - a concept later formalized by Cauchy using limits.
Linearity rule
Linearity rule
The derivative is a **linear operation**. This means:
A constant **factors out** of the derivative. A sum is differentiated **term by term**. Linearity makes the gradient of a sum of losses equal to the sum of gradients - a key property for mini-batch training.
Polynomial
Linearity in action
$(3x^2 + 5x - 2)'$ Apply linearity: $$= 3(x^2)' + 5(x)' - (2)'$$ $$= 3 \cdot 2x + 5 \cdot 1 - 0$$ $$= 6x + 5$$
What is $(7x^3 - 2x + 4)'$?
$(7x^3)' = 7 \cdot 3x^2 = 21x^2$, $(-2x)' = -2$, $(4)' = 0$. Total: $21x^2 - 2$.
Product rule
Product rule
The derivative of a product is **NOT** the product of the derivatives! Leibniz's rule:
**Mnemonic**: 'derivative of the first times the second, plus the first times the derivative of the second'. Or simply: $u'v + uv'$. In backprop this means: when differentiating a product of two activations, each one 'takes the prime' in turn.
x² · sin(x)
Product rule at work
$(x^2 \sin x)'$ $f = x^2$, $g = \sin x$: $$= (x^2)' \cdot \sin x + x^2 \cdot (\sin x)'$$ $$= 2x \sin x + x^2 \cos x$$ $$= x(2 \sin x + x \cos x)$$
Three factors
Generalization of the rule
$(xyz)' = ?$ $(xy \cdot z)' = (xy)' \cdot z + xy \cdot z'$ $= (x'y + xy') \cdot z + xyz'$ $= \boxed{x'yz + xy'z + xyz'}$ **Pattern**: each factor takes the prime in turn. In a three-layer network - analogously.
$(fg)' = f' \cdot g'$ - the derivative of a product equals the product of the derivatives
$(fg)' = f'g + fg'$ - Leibniz's rule must be used
Check: $(x \cdot x)' = (x^2)' = 2x$, but $x' \cdot x' = 1 \cdot 1 = 1 \neq 2x$. The difference is huge!
What is $(e^x \cdot \ln x)'$?
$(e^x)' = e^x$, $(\ln x)' = \frac{1}{x}$. By the product rule: $e^x \cdot \ln x + e^x \cdot \frac{1}{x} = e^x(\ln x + \frac{1}{x})$.
Quotient rule
Quotient rule
For a fraction the formula is more involved - it follows from the product rule:
**Mnemonic**: 'lo d-hi minus hi d-lo, all over lo squared'. The minus sign and $g^2$ in the denominator - the main things not to forget.
x / (x² + 1)
A fraction similar to softmax-like functions
$\left(\frac{x}{x^2 + 1}\right)'$ $f = x$, $f' = 1$; $g = x^2 + 1$, $g' = 2x$: $$= \frac{1 \cdot (x^2+1) - x \cdot 2x}{(x^2+1)^2} = \frac{x^2 + 1 - 2x^2}{(x^2+1)^2} = \frac{1 - x^2}{(x^2+1)^2}$$ At $x = \pm 1$ the derivative is 0 - extrema of the function.
What is $\left(\frac{\sin x}{x}\right)'$?
By the quotient rule: $\frac{(\sin x)' \cdot x - \sin x \cdot (x)'}{x^2} = \frac{x\cos x - \sin x}{x^2}$.
Complete table of derivatives
Complete table of derivatives
Power functions
| $f(x)$ | $f'(x)$ | Example |
|---|---|---|
| $x^n$ (any $n$) | $nx^{n-1}$ | $(x^5)' = 5x^4$ |
| $\sqrt{x} = x^{1/2}$ | $\frac{1}{2\sqrt{x}}$ | $(\sqrt{x})' = \frac{1}{2\sqrt{x}}$ |
| $\frac{1}{x} = x^{-1}$ | $-\frac{1}{x^2}$ | $(1/x)' = -1/x^2$ |
| $\sqrt[n]{x} = x^{1/n}$ | $\frac{1}{n}x^{1/n - 1}$ | $(\sqrt[3]{x})' = \frac{1}{3x^{2/3}}$ |
Trigonometric functions
| $f(x)$ | $f'(x)$ |
|---|---|
| $\sin x$ | $\cos x$ |
| $\cos x$ | $-\sin x$ |
| $\tan x$ | $\frac{1}{\cos^2 x} = \sec^2 x$ |
| $\cot x$ | $-\frac{1}{\sin^2 x} = -\csc^2 x$ |
Exponential and logarithmic
| $f(x)$ | $f'(x)$ |
|---|---|
| $e^x$ | $e^x$ |
| $a^x$ | $a^x \ln a$ |
| $\ln x$ | $\frac{1}{x}$ |
| $\log_a x$ | $\frac{1}{x \ln a}$ |
$e^x$ is the **only** function (up to a constant) equal to its own derivative. Softmax, cross-entropy, the normal distribution - $e^x$ is everywhere.
What is $\frac{d}{dx}[\tan(x)]$?
Standard derivative: $(\sin/\cos)' = (\cos^2 + \sin^2)/\cos^2 = 1/\cos^2 = \sec^2$.
Higher-order derivatives
Higher-order derivatives
The derivative can be taken **repeatedly**. The second derivative determines convexity and is used in Newton's method for optimization:
Higher derivatives of x⁴
Degree decreases to zero
$f(x) = x^4$ $f'(x) = 4x^3$ $f''(x) = 12x^2$ $f'''(x) = 24x$ $f^{(4)}(x) = 24$ $f^{(5)}(x) = 0$ - and all subsequent ones too! **Pattern**: for a polynomial of degree $n$, the derivative of order $n+1$ is zero. This underlies the Taylor approximation of any function by a polynomial.
Derivatives of sin(x)
Cycle of length 4
$\sin x \to \cos x \to -\sin x \to -\cos x \to \sin x \to \ldots$ **Formula**: $(\sin x)^{(n)} = \sin(x + n \cdot \frac{\pi}{2})$ Fourier analysis works precisely because $\sin$ and $\cos$ are eigenfunctions of the differentiation operator.
$(f^2)' = (f')^2$ - the square of the derivative
$(f^2)' = 2f \cdot f'$ - product rule (or chain rule)
$(x^2)^2 = x^4$. The derivative is $(x^4)' = 4x^3$, but $(2x)^2 = 4x^2 \neq 4x^3$. The difference is fundamental.
If $f(x) = x^4$, what is $f^{(3)}(x)$?
$f' = 4x^3$, $f'' = 12x^2$, $f''' = 24x$.
Practice
Practice
Find the derivative of $f(x) = x^3 \cdot e^x$
$u = x^3$, $u' = 3x^2$; $v = e^x$, $v' = e^x$. $(x^3 \cdot e^x)' = 3x^2 \cdot e^x + x^3 \cdot e^x = e^x(3x^2 + x^3) = e^x \cdot x^2(3 + x)$
Find the derivative of $f(x) = \frac{x^2 - 1}{x^2 + 1}$
$f = x^2 - 1$, $f' = 2x$; $g = x^2 + 1$, $g' = 2x$. $$= \frac{2x(x^2+1) - (x^2-1) \cdot 2x}{(x^2+1)^2} = \frac{2x[(x^2+1) - (x^2-1)]}{(x^2+1)^2} = \frac{4x}{(x^2+1)^2}$$
Find $f^{(10)}(0)$ for $f(x) = e^x \cos x$
$e^x \cos x = \text{Re}(e^{(1+i)x})$. Derivative: $(e^{(1+i)x})^{(10)} = (1+i)^{10} e^{(1+i)x}$. $(1+i)^{10} = (\sqrt{2})^{10} e^{i \cdot 10\pi/4} = 32 e^{i5\pi/2} = 32 e^{i\pi/2} = 32i$. At $x=0$: $\text{Re}(32i) = 0$. **Answer**: $f^{(10)}(0) = 0$
Use the product rule on $f(x) = x^2 \sin(x)$. What is $f'(x)$?
Product rule: $(uv)' = u'v + uv'$ with $u = x^2$, $v = \sin x$.
Connection with other topics
Differentiation rules are the foundation of computational calculus
- Chain rule — Next lesson: derivative of a composition $(g(f(x)))' = g'(f(x)) \cdot f'(x)$ - the key to backprop
- Taylor series — Use higher-order derivatives to expand into a polynomial - approximating softmax, GELU
- Optimization — The second derivative determines convexity. Newton's method uses $f''$ for fast convergence
- Derivative — The definition of the derivative - previous lesson
Итоги
- **Linearity**: $(cf)' = cf'$, $(f \pm g)' = f' \pm g'$ - gradient of a sum of losses = sum of gradients
- **Product**: $(fg)' = f'g + fg'$ - NOT $f' \cdot g'$!
- **Quotient**: $(f/g)' = (f'g - fg')/g^2$
- **Power**: $(x^n)' = nx^{n-1}$ for any $n$, including fractional and negative
- **Higher derivatives**: $f^{(n)}$ - the second determines convexity, the $n$-th appears in the Taylor series
Вопросы для размышления
- Why is $(fg)' \neq f' \cdot g'$? What simple example demonstrates this?
- How does mini-batch training exploit the linearity of the derivative?
- Why does second derivative > 0 indicate a minimum rather than a maximum?