Probability Theory
Large Deviations Theory
Цели урока
- Understand the large deviations principle as a precise exponential estimate for rare events
- Master the Cramer rate function through the Legendre-Fenchel transform
- Analyze Sanov's theorem and its connection to KL divergence
- Apply the Donsker-Varadhan theorem to Markov chains
Предварительные знания
- Moment generating functions and their properties
- Convex functions and Legendre transforms
- KL divergence and information geometry
- Central limit theorem and law of large numbers
An insurance firm wants the probability of losses five times the annual average. The CLT answers 'approximately zero' - useless. Large deviations theory gives the precise P ~ exp(-n*I(x)).
- **Insurance:** Allianz reserves are computed via the Cramer rate function for tail losses at 10^(-9)
- **Coding theory:** Shannon's theorem uses Sanov's principle - decoding error rates are determined by KL divergence
- **Reinforcement learning:** concentration of the Q-function in Q-learning is controlled by Donsker-Varadhan; PPO uses KL constraints motivated by Sanov
- **Finance:** tail VaR estimation for portfolios is inaccessible to Gaussian methods - the Cramer asymptotic is required
Cramer's Theorem and the Rate Function
Allianz keeps reserves for losses exceeding the expected value fivefold - an event with probability 10^(-9). The CLT answers 'approximately zero', which is useless. Harald Cramer solved this in 1938: P(X-bar_n >= x) ~ exp(-n*I(x)), where I is the rate function computed via the Legendre transform.
For the normal distribution I(x) = (x-mu)^2/(2*sigma^2) - the Cramer estimate matches Gaussian tails exactly. In general I is asymmetric: P(X-bar_n - E[X] > eps) and P(X-bar_n - E[X] < -eps) can decay at different rates.
The Cramer rate function I(x) is the Legendre conjugate of which function?
Sanov's Theorem for Empirical Measures
Ivan Sanov in 1957 lifted large deviations from scalar averages to distributions: what is the probability that the empirical measure mu-hat_n over a sample of size n lies closer to nu rather than the true mu? The answer is KL divergence in the exponent. This linked large deviations with Shannon information theory and underpinned the coding theorems.
Sanov's principle proves the second law of thermodynamics probabilistically: the macrostate of maximum entropy is the I-projection of the uniform distribution onto the constraint set; deviations from it are exponentially unlikely with exponent = entropy deficit.
What is the rate function in Sanov's theorem for the empirical measure?
Donsker-Varadhan Theorem for Markov Chains
Monica Donsker and Srinivasa Varadhan, in 1975-1983, built the large deviations principle for Markov chains and stochastic processes - work that earned Varadhan the Abel Prize in 2007. In modern reinforcement learning this underlies concentration estimates in Q-learning: the probability that the empirical state distribution of a chain deviates from the stationary one is controlled by the Donsker-Varadhan functional.
Large deviations connect probability and information
The large deviations principle is a bridge between probability, information theory, and statistical physics.
- Information theory — Sanov's principle: the rate function for empirical measures is KL divergence D(nu||mu); the foundation of Shannon coding theorems
- Statistical mechanics — Boltzmann's principle S = -k*log P(macrostate) is a special case of the large deviations principle
- Reinforcement learning — Concentration of the Q-function in RL is controlled by Donsker-Varadhan; PPO/TRPO use KL constraints motivated by Sanov
- Stochastic processes — Donsker-Varadhan extends large deviations to Markov chains, diffusions, and empirical occupation measures
Итоги
- **Cramer's theorem:** P(X-bar_n >= x) ~ exp(-n*I(x)), I = Legendre conjugate of the log-MGF
- **Sanov's theorem:** P(mu-hat_n ~ nu) ~ exp(-n*D_KL(nu||mu)) - rate function = KL divergence
- **Contraction principle:** Cramer follows from Sanov by projecting onto the constraint E_nu[X] = x
- **Gartner-Ellis:** extension of Cramer to dependent variables via the limiting log-MGF
- **Donsker-Varadhan:** large deviations for Markov chains; functional I(nu) accounts for the transition operator P
- **Link to CLT:** CLT gives a Gaussian approximation at 1/sqrt(n); LDP gives the precise exponential asymptotic
What distinguishes the Donsker-Varadhan functional from the KL divergence in Sanov's theorem?