Calculus

Derivative of a Composite Function

Цели урока

Understand and apply the chain rule
Differentiate multi-layer compositions
Compute derivatives of inverse functions
Master logarithmic differentiation
Know the derivatives of inverse trigonometric functions

Предварительные знания

Differentiation rules
The concept of function composition
Properties of logarithms

Differentiation Rules

In 1974, Paul Werbos applied the chain rule to multilayer network training in his Harvard PhD thesis. The idea was ignored until 1986, when Rumelhart, Hinton and Williams published "Learning representations by back-propagating errors" in Nature. Since then everything from GPT to AlphaFold runs on repeated applications of the chain rule across billions of parameters. PyTorch autograd (2017) and JAX (2018) automate exactly this operation.

**PyTorch autograd**: torch.Tensor.backward() is the applied chain rule. Each op remembers its local gradient, then the chain multiplies from loss back to parameters
**GPT-3 (175B parameters)**: one training step is the chain rule applied to a graph with around 10^12 nodes. Without backprop, training is impossible
**JAX grad/vjp/jvp**: forward-mode and reverse-mode autodiff are both forms of the chain rule. vjp powers backprop, jvp linearises the Jacobian
**Vanishing gradients**: long chains of products below one collapse to zero. This is why LSTMs (1997), residual connections in ResNet (2015), and activation normalisation exist

The Foundation of Modern ML

The chain rule has been known since the 17th century, but its significance grew in the era of machine learning. In 1970 **Paul Werbos** described the application of the chain rule for training neural networks - this became the basis of **backpropagation**. In 1986 Rumelhart and Hinton popularized the method, and today every time a neural network is trained, repeated application of the chain rule lies at its core.

The Chain Rule

For the **composition** $f(g(x))$ - a "function of a function" - the derivative is computed as follows:

**Algorithm**:

Identify the **outer** function $f$ and the **inner** function $g$
Take the derivative of the outer function, leaving the inner function unchanged
Multiply by the derivative of the inner function

**Mnemonic**: "derivative of outside × derivative of inside".

For the function $\sin(x^3)$: which is the outer function and which is the inner?

First we compute $x^3$ (inner), then we take $\sin$ of the result (outer). Sine is the outermost layer.

In Leibniz Notation

Let $y = f(u)$ and $u = g(x)$. Then:

It looks as though $du$ "cancels out"! This is not a genuine cancellation, but the mnemonic works.

Leibniz's notation shows the **path** of the derivative: from $y$ to $u$, from $u$ to $x$. Hence the name - the **chain** rule.

In Leibniz notation, the chain rule for $y = y(u)$, $u = u(x)$ is written as:

The chain of dependencies $x \to u \to y$ multiplies rates: $dy/dx = (dy/du)(du/dx)$.

Examples

Power of an Expression

$(x^2 + 1)^3$

$f(x) = (x^2 + 1)^3$ Outer: $u^3$, its derivative $3u^2$ Inner: $u = x^2 + 1$, its derivative $2x$ $$f'(x) = 3(x^2 + 1)^2 \cdot 2x = \boxed{6x(x^2 + 1)^2}$$

Trigonometry of an Expression

$\sin(x^2)$

$f(x) = \sin(x^2)$ Outer: $\sin u$, its derivative $\cos u$ Inner: $u = x^2$, its derivative $2x$ $$f'(x) = \cos(x^2) \cdot 2x = \boxed{2x\cos(x^2)}$$

$(\sin(x^2))' = \cos(x^2)$ - forgot the inner derivative

$(\sin(x^2))' = 2x\cos(x^2)$ - we must multiply by the derivative of the argument

The derivative of $\sin$ gives $\cos$, but the argument is not simply $x$, it is $x^2$. We still need to multiply by $(x^2)' = 2x$!

Exponential of a Linear Expression

$e^{3x}$

$f(x) = e^{3x}$ Outer: $e^u$, its derivative $e^u$ Inner: $u = 3x$, its derivative $3$ $$f'(x) = e^{3x} \cdot 3 = \boxed{3e^{3x}}$$ **Generalization**: $(e^{ax})' = ae^{ax}$

What is $(\cos(5x))'$?

The outer $\cos u$ gives $-\sin u$. The inner $5x$ gives $5$. Result: $-\sin(5x) \cdot 5 = -5\sin(5x)$.

Multi-Layer Compositions

For a triple composition $f(g(h(x)))$ the rule is applied **in sequence**:

Triple Composition

$\sin(\cos(x^2))$

$f(x) = \sin(\cos(x^2))$ Three layers: $\sin$, $\cos$, $x^2$ $$f'(x) = \cos(\cos(x^2)) \cdot (-\sin(x^2)) \cdot 2x$$ $$= \boxed{-2x \sin(x^2) \cos(\cos(x^2))}$$ **Order**: work from the outermost to the innermost, multiplying the derivatives.

In neural network **backpropagation** exactly this happens: the gradient "flows" through the chain, being multiplied by the local derivative of each layer.

Differentiate $f(x) = \sin(\cos(x^2))$.

Chain rule applied three times: outer $\sin'$ at $\cos(x^2)$, then $\cos'$ at $x^2$, then $(x^2)'=2x$.

Derivative of the Inverse Function

If $y = f^{-1}(x)$, then $x = f(y)$. Applying the chain rule:

Derivation of $(\arcsin x)'$

Via the inverse function

If $y = \arcsin x$, then $x = \sin y$ Differentiate $x = \sin y$ with respect to $x$: $$1 = \cos y \cdot \frac{dy}{dx}$$ $$\frac{dy}{dx} = \frac{1}{\cos y}$$ From $\sin^2 y + \cos^2 y = 1$ and $\sin y = x$: $$\cos y = \sqrt{1 - x^2}$$ (taking the positive root for $y \in [-\pi/2, \pi/2]$) $$(\arcsin x)' = \boxed{\frac{1}{\sqrt{1 - x^2}}}$$

Table of Inverse Trigonometric Derivatives

$f(x)$	$f'(x)$	Domain
$\arcsin x$	$\frac{1}{\sqrt{1-x^2}}$	$\|x\| < 1$
$\arccos x$	$-\frac{1}{\sqrt{1-x^2}}$	$\|x\| < 1$
$\arctan x$	$\frac{1}{1+x^2}$	$x \in \mathbb{R}$
$\text{arccot}\, x$	$-\frac{1}{1+x^2}$	$x \in \mathbb{R}$

Why is $(\arccos x)' = -\frac{1}{\sqrt{1-x^2}}$ and not $+$?

$\arccos x$ is a decreasing function. The derivative of a decreasing function is negative. Also: $\arccos x = \frac{\pi}{2} - \arcsin x$, so the derivatives differ in sign.

Logarithmic Differentiation

For functions of the form $f(x)^{g(x)}$ - "variable raised to a variable power" - we first **take the logarithm**:

$x^x$ - a Classic

Variable raised to a variable power

$y = x^x$ (for $x > 0$) **Step 1**: Take the logarithm of both sides $$\ln y = \ln(x^x) = x \ln x$$ **Step 2**: Differentiate both sides $$\frac{y'}{y} = (x \ln x)' = 1 \cdot \ln x + x \cdot \frac{1}{x} = \ln x + 1$$ **Step 3**: Multiply by $y$ $$y' = y(\ln x + 1) = \boxed{x^x(\ln x + 1)}$$

Complex Product

Logarithm simplifies

$y = \frac{x^2 \sqrt{x+1}}{(x-1)^3}$ (for $x > 1$) Take the logarithm: $$\ln y = 2\ln x + \frac{1}{2}\ln(x+1) - 3\ln(x-1)$$ Differentiate: $$\frac{y'}{y} = \frac{2}{x} + \frac{1}{2(x+1)} - \frac{3}{x-1}$$ $$y' = y \left(\frac{2}{x} + \frac{1}{2(x+1)} - \frac{3}{x-1}\right)$$

**Logarithmic differentiation** turns products into sums, quotients into differences, and powers into multipliers - everything simplifies!

$(x^x)' = x \cdot x^{x-1} = x^x$ - treated as an ordinary power

$(x^x)' = x^x(\ln x + 1)$ - logarithmic differentiation is required

The formula $(x^n)' = nx^{n-1}$ works only for a **constant** exponent $n$. When the exponent depends on $x$, it's a different story!

Using logarithmic differentiation, what is $\frac{d}{dx}[x^x]$?

Let $y = x^x$. Then $\ln y = x \ln x$, differentiate: $y'/y = \ln x + 1$, so $y' = x^x(\ln x + 1)$.

Connections to Other Topics

The chain rule is a bridge to advanced methods

Backpropagation in ML — The gradient "flows" through the chain of network layers
Implicit Differentiation — Uses the chain rule for related variables
Integration by Substitution — The inverse operation of the chain rule
Multivariable Analysis — The chain rule generalizes to functions of several variables
Differential Equations — Method of separation of variables

Итоги

**Chain rule**: $(f(g(x)))' = f'(g(x)) \cdot g'(x)$ - outer × inner
In Leibniz notation: $\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$ - "$du$ cancels"
For many layers: multiply the derivatives of all layers
**Inverse functions**: $(f^{-1})' = \frac{1}{f'(f^{-1}(x))}$
**Logarithmic diff.**: for $f^g$ first take the log, then differentiate

Вопросы для размышления

Why is the most common mistake forgetting the inner derivative?
How is the chain rule related to backpropagation in neural networks?
Why is $(\arcsin x)' = \frac{1}{\sqrt{1-x^2}}$ and not simply $\frac{1}{\cos x}$?
When is logarithmic differentiation more convenient than the standard approach?

Связанные уроки

ml-26-backpropagation