Calculus
Derivative of a Composite Function
Цели урока
- Understand and apply the chain rule
- Differentiate multi-layer compositions
- Compute derivatives of inverse functions
- Master logarithmic differentiation
- Know the derivatives of inverse trigonometric functions
Предварительные знания
- Differentiation rules
- The concept of function composition
- Properties of logarithms
In 1974, Paul Werbos applied the chain rule to multilayer network training in his Harvard PhD thesis. The idea was ignored until 1986, when Rumelhart, Hinton and Williams published "Learning representations by back-propagating errors" in Nature. Since then everything from GPT to AlphaFold runs on repeated applications of the chain rule across billions of parameters. PyTorch autograd (2017) and JAX (2018) automate exactly this operation.
- **PyTorch autograd**: torch.Tensor.backward() is the applied chain rule. Each op remembers its local gradient, then the chain multiplies from loss back to parameters
- **GPT-3 (175B parameters)**: one training step is the chain rule applied to a graph with around 10^12 nodes. Without backprop, training is impossible
- **JAX grad/vjp/jvp**: forward-mode and reverse-mode autodiff are both forms of the chain rule. vjp powers backprop, jvp linearises the Jacobian
- **Vanishing gradients**: long chains of products below one collapse to zero. This is why LSTMs (1997), residual connections in ResNet (2015), and activation normalisation exist
The Foundation of Modern ML
The chain rule has been known since the 17th century, but its significance grew in the era of machine learning. In 1970 **Paul Werbos** described the application of the chain rule for training neural networks - this became the basis of **backpropagation**. In 1986 Rumelhart and Hinton popularized the method, and today every time a neural network is trained, repeated application of the chain rule lies at its core.
The Chain Rule
The Chain Rule
For the **composition** $f(g(x))$ - a "function of a function" - the derivative is computed as follows:
**Algorithm**:
- Identify the **outer** function $f$ and the **inner** function $g$
- Take the derivative of the outer function, leaving the inner function unchanged
- Multiply by the derivative of the inner function
**Mnemonic**: "derivative of outside × derivative of inside".
For the function $\sin(x^3)$: which is the outer function and which is the inner?
First we compute $x^3$ (inner), then we take $\sin$ of the result (outer). Sine is the outermost layer.
In Leibniz Notation
In Leibniz Notation
Let $y = f(u)$ and $u = g(x)$. Then:
It looks as though $du$ "cancels out"! This is not a genuine cancellation, but the mnemonic works.
Leibniz's notation shows the **path** of the derivative: from $y$ to $u$, from $u$ to $x$. Hence the name - the **chain** rule.
In Leibniz notation, the chain rule for $y = y(u)$, $u = u(x)$ is written as:
The chain of dependencies $x \to u \to y$ multiplies rates: $dy/dx = (dy/du)(du/dx)$.
Examples
Examples
Power of an Expression
$(x^2 + 1)^3$
$f(x) = (x^2 + 1)^3$ Outer: $u^3$, its derivative $3u^2$ Inner: $u = x^2 + 1$, its derivative $2x$ $$f'(x) = 3(x^2 + 1)^2 \cdot 2x = \boxed{6x(x^2 + 1)^2}$$
Trigonometry of an Expression
$\sin(x^2)$
$f(x) = \sin(x^2)$ Outer: $\sin u$, its derivative $\cos u$ Inner: $u = x^2$, its derivative $2x$ $$f'(x) = \cos(x^2) \cdot 2x = \boxed{2x\cos(x^2)}$$
$(\sin(x^2))' = \cos(x^2)$ - forgot the inner derivative
$(\sin(x^2))' = 2x\cos(x^2)$ - we must multiply by the derivative of the argument
The derivative of $\sin$ gives $\cos$, but the argument is not simply $x$, it is $x^2$. We still need to multiply by $(x^2)' = 2x$!
Exponential of a Linear Expression
$e^{3x}$
$f(x) = e^{3x}$ Outer: $e^u$, its derivative $e^u$ Inner: $u = 3x$, its derivative $3$ $$f'(x) = e^{3x} \cdot 3 = \boxed{3e^{3x}}$$ **Generalization**: $(e^{ax})' = ae^{ax}$
What is $(\cos(5x))'$?
The outer $\cos u$ gives $-\sin u$. The inner $5x$ gives $5$. Result: $-\sin(5x) \cdot 5 = -5\sin(5x)$.
Multi-Layer Compositions
Multi-Layer Compositions
For a triple composition $f(g(h(x)))$ the rule is applied **in sequence**:
Triple Composition
$\sin(\cos(x^2))$
$f(x) = \sin(\cos(x^2))$ Three layers: $\sin$, $\cos$, $x^2$ $$f'(x) = \cos(\cos(x^2)) \cdot (-\sin(x^2)) \cdot 2x$$ $$= \boxed{-2x \sin(x^2) \cos(\cos(x^2))}$$ **Order**: work from the outermost to the innermost, multiplying the derivatives.
In neural network **backpropagation** exactly this happens: the gradient "flows" through the chain, being multiplied by the local derivative of each layer.
Differentiate $f(x) = \sin(\cos(x^2))$.
Chain rule applied three times: outer $\sin'$ at $\cos(x^2)$, then $\cos'$ at $x^2$, then $(x^2)'=2x$.
Derivative of the Inverse Function
Derivative of the Inverse Function
If $y = f^{-1}(x)$, then $x = f(y)$. Applying the chain rule:
Derivation of $(\arcsin x)'$
Via the inverse function
If $y = \arcsin x$, then $x = \sin y$ Differentiate $x = \sin y$ with respect to $x$: $$1 = \cos y \cdot \frac{dy}{dx}$$ $$\frac{dy}{dx} = \frac{1}{\cos y}$$ From $\sin^2 y + \cos^2 y = 1$ and $\sin y = x$: $$\cos y = \sqrt{1 - x^2}$$ (taking the positive root for $y \in [-\pi/2, \pi/2]$) $$(\arcsin x)' = \boxed{\frac{1}{\sqrt{1 - x^2}}}$$
Table of Inverse Trigonometric Derivatives
| $f(x)$ | $f'(x)$ | Domain |
|---|---|---|
| $\arcsin x$ | $\frac{1}{\sqrt{1-x^2}}$ | $|x| < 1$ |
| $\arccos x$ | $-\frac{1}{\sqrt{1-x^2}}$ | $|x| < 1$ |
| $\arctan x$ | $\frac{1}{1+x^2}$ | $x \in \mathbb{R}$ |
| $\text{arccot}\, x$ | $-\frac{1}{1+x^2}$ | $x \in \mathbb{R}$ |
Why is $(\arccos x)' = -\frac{1}{\sqrt{1-x^2}}$ and not $+$?
$\arccos x$ is a decreasing function. The derivative of a decreasing function is negative. Also: $\arccos x = \frac{\pi}{2} - \arcsin x$, so the derivatives differ in sign.
Logarithmic Differentiation
Logarithmic Differentiation
For functions of the form $f(x)^{g(x)}$ - "variable raised to a variable power" - we first **take the logarithm**:
$x^x$ - a Classic
Variable raised to a variable power
$y = x^x$ (for $x > 0$) **Step 1**: Take the logarithm of both sides $$\ln y = \ln(x^x) = x \ln x$$ **Step 2**: Differentiate both sides $$\frac{y'}{y} = (x \ln x)' = 1 \cdot \ln x + x \cdot \frac{1}{x} = \ln x + 1$$ **Step 3**: Multiply by $y$ $$y' = y(\ln x + 1) = \boxed{x^x(\ln x + 1)}$$
Complex Product
Logarithm simplifies
$y = \frac{x^2 \sqrt{x+1}}{(x-1)^3}$ (for $x > 1$) Take the logarithm: $$\ln y = 2\ln x + \frac{1}{2}\ln(x+1) - 3\ln(x-1)$$ Differentiate: $$\frac{y'}{y} = \frac{2}{x} + \frac{1}{2(x+1)} - \frac{3}{x-1}$$ $$y' = y \left(\frac{2}{x} + \frac{1}{2(x+1)} - \frac{3}{x-1}\right)$$
**Logarithmic differentiation** turns products into sums, quotients into differences, and powers into multipliers - everything simplifies!
$(x^x)' = x \cdot x^{x-1} = x^x$ - treated as an ordinary power
$(x^x)' = x^x(\ln x + 1)$ - logarithmic differentiation is required
The formula $(x^n)' = nx^{n-1}$ works only for a **constant** exponent $n$. When the exponent depends on $x$, it's a different story!
Using logarithmic differentiation, what is $\frac{d}{dx}[x^x]$?
Let $y = x^x$. Then $\ln y = x \ln x$, differentiate: $y'/y = \ln x + 1$, so $y' = x^x(\ln x + 1)$.
Connections to Other Topics
The chain rule is a bridge to advanced methods
- Backpropagation in ML — The gradient "flows" through the chain of network layers
- Implicit Differentiation — Uses the chain rule for related variables
- Integration by Substitution — The inverse operation of the chain rule
- Multivariable Analysis — The chain rule generalizes to functions of several variables
- Differential Equations — Method of separation of variables
Итоги
- **Chain rule**: $(f(g(x)))' = f'(g(x)) \cdot g'(x)$ - outer × inner
- In Leibniz notation: $\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$ - "$du$ cancels"
- For many layers: multiply the derivatives of all layers
- **Inverse functions**: $(f^{-1})' = \frac{1}{f'(f^{-1}(x))}$
- **Logarithmic diff.**: for $f^g$ first take the log, then differentiate
Вопросы для размышления
- Why is the most common mistake forgetting the inner derivative?
- How is the chain rule related to backpropagation in neural networks?
- Why is $(\arcsin x)' = \frac{1}{\sqrt{1-x^2}}$ and not simply $\frac{1}{\cos x}$?
- When is logarithmic differentiation more convenient than the standard approach?