Differential Equations
Stochastic Differential Equations (SDEs)
How do one price a stock option? How do one model molecular collisions? How do diffusion models in DALL-E and Stable Diffusion actually work? All of these problems share one mathematical object: a stochastic differential equation, where deterministic drift coexists with random noise.
- **Finance**: the Black-Scholes model for option pricing is built on GBM - an SDE with multiplicative noise; Nobel Prize in Economics, 1997
- **Physics**: the Langevin equation dv = -gamma*v dt + sigma dB describes a Brownian particle in a fluid; it is the stochastic analogue of Newton's second law
- **Generative AI**: diffusion models (DDPM, score-based) use SDEs to gradually dissolve data into noise and then reverse the process to generate new samples
Предварительные знания
Brownian motion: continuous and nowhere differentiable paths
Brownian motion (the Wiener process) B(t) is the mathematical model for a random walk in continuous time. It is the foundation of all SDE theory. Norbert Wiener gave a rigorous definition in 1923; the botanist Robert Brown had observed pollen grains jittering in water back in 1827.
A derivative B'(t) = lim [B(t+h) - B(t)] / h would require the increment B(t+h) - B(t) ~ N(0, h) to be of order h. But the standard deviation sqrt(h) >> h for small h, so the increment is 'too large' for a derivative to exist. Intuition: the path is too rough - it is fractal with dimension 3/2 (path) and Holder continuity exponent strictly less than 1/2.
What is the quadratic variation of Brownian motion B(t) on [0, T]?
The Ito integral and Ito's lemma
The ordinary Riemann integral does not work for Brownian motion; B(t) is too rough. Kiyosi Ito constructed the stochastic integral of f dB in 1944, consistent with the quadratic variation. The central consequence is **Ito's lemma**, the stochastic chain rule.
In ordinary calculus: if X shifts by dX, then f(X) changes by approximately f'(X) dX (linear approximation). In stochastic calculus: (dB)^2 is of order dt (not dt^2!), so the second-order Taylor term must be kept. That is the Ito correction. Analogy: Consider the floor one walk on vibrates so violently that the square of a small step is comparable to the step itself - classical calculus no longer applies.
Why does Ito's lemma contain the extra term (1/2)*sigma^2*d^2g/dX^2, absent from the ordinary chain rule?
SDEs: geometric Brownian motion and stock price models
A **stochastic differential equation** (SDE) reads: dX = mu(X, t) dt + sigma(X, t) dB, where mu is the drift coefficient and sigma is the diffusion coefficient. It generalizes an ODE by adding random noise.
A GBM paradox: E[S(T)] = S0*e^(mu*T), but the typical trajectory (median) = S0*e^((mu - sigma^2/2)*T). The correction -sigma^2/2 is the Ito correction. Variance 'eats' part of the drift. The more volatile the asset (larger sigma), the more the 'average' diverges from the 'typical' trajectory. This has a practical consequence: a high-volatility portfolio underperforms a low-volatility one at equal mean returns - the Kelly criterion.
What is the exact solution to the GBM SDE dS = mu*S dt + sigma*S dB?
Fokker-Planck equation: evolution of the probability density
An SDE describes individual trajectories. Often, though, what we need is not one trajectory but the **distribution** over all possible trajectories. The Fokker-Planck equation (also Kolmogorov forward equation) describes the evolution of the probability density p(x, t).
The Fokker-Planck equation underlies diffusion generative models (score-based models, DDPM): **Forward process**: gradually add noise via dX = sigma dB (diffuse data into white noise) **Reverse process** (Song et al., 2020): an SDE can be reversed! The reverse SDE requires knowledge of the 'score' grad_x log p(x, t). A neural network is trained to approximate this score, enabling sample generation by 'reversing' diffusion. This is exactly how Stable Diffusion and DALL-E 3 work internally.
What does the Fokker-Planck equation describe in the context of an SDE?
Key ideas
- **Brownian motion** B(t): B(t) - B(s) ~ N(0, t-s), continuous, nowhere differentiable; quadratic variation (dB)^2 = dt
- **Ito's lemma**: dg(X) = (dg/dt + mu*dg/dX + (1/2)*sigma^2*d^2g/dX^2) dt + sigma*dg/dX dB - the correction (1/2)*sigma^2*d^2g/dX^2 is mandatory because (dB)^2 = dt
- **GBM**: dS = mu*S dt + sigma*S dB -> S(t) = S0*exp((mu-sigma^2/2)*t + sigma*B(t)) - the basis of Black-Scholes
- **Fokker-Planck equation**: PDE for the density p(x, t) = -d(mu*p)/dx + (1/2)*d^2(sigma^2*p)/dx^2; underlies diffusion generative models
Related topics
SDEs lie at the intersection of probability theory, physics, and ML:
- Neural ODEs and differentiable solvers — SDE + neural network = score-based models; the reverse SDE requires a neural score function
- Heat equation (parabolic type) — The Fokker-Planck equation generalizes the heat equation to the case with drift
- Stochastic processes — SDEs are a special class of Markov stochastic processes with continuous trajectories
Вопросы для размышления
- Why does the Ito integral use left endpoints of the partition (not right or midpoints)? What does the right endpoint give (Stratonovich integral)?
- The Ito correction -sigma^2/2 in GBM: why does a more volatile asset with the same mu yield a smaller 'typical' outcome? How does this relate to the Kelly criterion?
- Diffusion models train a network to approximate the score grad log p(x, t). Why is knowing the score sufficient to generate new samples?