Probability Theory

Large Deviations Theory

Цели урока

Understand the large deviations principle as a precise exponential estimate for rare events
Master the Cramer rate function through the Legendre-Fenchel transform
Analyze Sanov's theorem and its connection to KL divergence
Apply the Donsker-Varadhan theorem to Markov chains

Предварительные знания

Moment generating functions and their properties
Convex functions and Legendre transforms
KL divergence and information geometry
Central limit theorem and law of large numbers

Random Matrices

An insurance firm wants the probability of losses five times the annual average. The CLT answers 'approximately zero' - useless. Large deviations theory gives the precise P ~ exp(-n*I(x)).

**Insurance:** Allianz reserves are computed via the Cramer rate function for tail losses at 10^(-9)
**Coding theory:** Shannon's theorem uses Sanov's principle - decoding error rates are determined by KL divergence
**Reinforcement learning:** concentration of the Q-function in Q-learning is controlled by Donsker-Varadhan; PPO uses KL constraints motivated by Sanov
**Finance:** tail VaR estimation for portfolios is inaccessible to Gaussian methods - the Cramer asymptotic is required

Cramer's Theorem and the Rate Function

Allianz keeps reserves for losses exceeding the expected value fivefold - an event with probability 10^(-9). The CLT answers 'approximately zero', which is useless. Harald Cramer solved this in 1938: P(X-bar_n >= x) ~ exp(-n*I(x)), where I is the rate function computed via the Legendre transform.

For the normal distribution I(x) = (x-mu)^2/(2*sigma^2) - the Cramer estimate matches Gaussian tails exactly. In general I is asymmetric: P(X-bar_n - E[X] > eps) and P(X-bar_n - E[X] < -eps) can decay at different rates.

The Cramer rate function I(x) is the Legendre conjugate of which function?

Sanov's Theorem for Empirical Measures

Ivan Sanov in 1957 lifted large deviations from scalar averages to distributions: what is the probability that the empirical measure mu-hat_n over a sample of size n lies closer to nu rather than the true mu? The answer is KL divergence in the exponent. This linked large deviations with Shannon information theory and underpinned the coding theorems.

Sanov's principle proves the second law of thermodynamics probabilistically: the macrostate of maximum entropy is the I-projection of the uniform distribution onto the constraint set; deviations from it are exponentially unlikely with exponent = entropy deficit.

What is the rate function in Sanov's theorem for the empirical measure?

Donsker-Varadhan Theorem for Markov Chains

Monica Donsker and Srinivasa Varadhan, in 1975-1983, built the large deviations principle for Markov chains and stochastic processes - work that earned Varadhan the Abel Prize in 2007. In modern reinforcement learning this underlies concentration estimates in Q-learning: the probability that the empirical state distribution of a chain deviates from the stationary one is controlled by the Donsker-Varadhan functional.

Large deviations connect probability and information

The large deviations principle is a bridge between probability, information theory, and statistical physics.

Information theory — Sanov's principle: the rate function for empirical measures is KL divergence D(nu||mu); the foundation of Shannon coding theorems
Statistical mechanics — Boltzmann's principle S = -k*log P(macrostate) is a special case of the large deviations principle
Reinforcement learning — Concentration of the Q-function in RL is controlled by Donsker-Varadhan; PPO/TRPO use KL constraints motivated by Sanov
Stochastic processes — Donsker-Varadhan extends large deviations to Markov chains, diffusions, and empirical occupation measures

Итоги

**Cramer's theorem:** P(X-bar_n >= x) ~ exp(-n*I(x)), I = Legendre conjugate of the log-MGF
**Sanov's theorem:** P(mu-hat_n ~ nu) ~ exp(-n*D_KL(nu||mu)) - rate function = KL divergence
**Contraction principle:** Cramer follows from Sanov by projecting onto the constraint E_nu[X] = x
**Gartner-Ellis:** extension of Cramer to dependent variables via the limiting log-MGF
**Donsker-Varadhan:** large deviations for Markov chains; functional I(nu) accounts for the transition operator P
**Link to CLT:** CLT gives a Gaussian approximation at 1/sqrt(n); LDP gives the precise exponential asymptotic

What distinguishes the Donsker-Varadhan functional from the KL divergence in Sanov's theorem?