Probability Theory

Modes of Convergence

The sample mean $\bar X_n$ converges to $\mu$ - but in what sense? Almost sure convergence, convergence in probability, in $L^2$, and in distribution are four different concepts with different implications. Choosing the wrong one leads to incorrect statistical conclusions.

SGD converges to a critical point 'almost surely' under decreasing step sizes - in the strict sense of a.s. convergence. This is stronger than convergence in probability: it says that a single trajectory of the algorithm converges, not just that most trajectories do.
The CLT asserts convergence in distribution: $(\bar X_n - \mu)/(\sigma/\sqrt{n}) \xrightarrow{d} \mathcal{N}(0,1)$. This underlies confidence intervals - but does not mean $\bar X_n$ itself converges to a normal random variable.
Bootstrap estimates the variance of statistics through convergence in distribution: the bootstrap distribution of $\hat\theta^* - \hat\theta$ approximates that of $\hat\theta - \theta$. The justification is precisely convergence in distribution.

Цели урока

Distinguish four types of convergence: a.s., in probability, in $L^p$, and in distribution
Construct counterexamples showing the non-equivalence of these convergence types
Apply the CLT and the continuous mapping theorem to analyze statistics

Предварительные знания

Random variables and their distributions
Expectation and variance
Lebesgue integral and measurable functions

Hierarchy of convergence types

Let $X_n, X$ be random variables on the same probability space. Almost sure convergence: $P(\lim_n X_n = X) = 1$. Convergence in probability: $P(|X_n - X| > \varepsilon) \to 0$ for every $\varepsilon > 0$. $L^p$ convergence: $\mathbb{E}[|X_n - X|^p] \to 0$. Convergence in distribution: $F_{X_n}(x) \to F_X(x)$ at continuity points of $F_X$. Implications: a.s. and $L^p$ each imply in probability; in probability implies in distribution. None of the reverses hold in general.

The Portmanteau theorem on convergence in distribution: $X_n \xrightarrow{d} X$ if and only if $\mathbb{E}[f(X_n)] \to \mathbb{E}[f(X)]$ for all bounded continuous $f$. This definition does not require a common probability space and generalizes to metric spaces - the foundation of weak convergence of measures.

Convergence in distribution does not imply convergence of expectations. Classic example: $X_n = n \cdot \mathbf{1}_{[0,1/n]}$ on $[0,1]$. Then $X_n \xrightarrow{d} 0$, but $\mathbb{E}[X_n] = 1 \not\to 0$. Uniform integrability is the sufficient condition for passing to the limit under the expectation.

The central limit theorem and the delta method

Lindeberg-Levy CLT: if $X_i$ are i.i.d. with mean $\mu$ and variance $\sigma^2 < \infty$, then $\sqrt{n}(\bar X_n - \mu)/\sigma \xrightarrow{d} \mathcal{N}(0,1)$. Berry-Esseen theorem: $\sup_x |F_n(x) - \Phi(x)| \leq C \mathbb{E}[|X|^3]/(\sigma^3 \sqrt{n})$. Delta method: if $\sqrt{n}(\bar X_n - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$ and $g$ is differentiable at $\mu$, then $\sqrt{n}(g(\bar X_n) - g(\mu)) \xrightarrow{d} \mathcal{N}(0, [g'(\mu)]^2 \sigma^2)$.

Convergence in Probability and the Law of Large Numbers

A neural network trained on ImageNet for 1 million gradient steps does not simply decrease its loss monotonically. The stochastic gradient is a random variable, and the word convergence requires a precise meaning. When a Monte Carlo estimator uses 10 million samples to estimate an integral, the result converges in probability: the probability of a large error shrinks to zero as the sample count grows, even if individual estimates can occasionally be wrong.

In ML, convergence in probability underpins PAC (probably approximately correct) learning. A PAC learner outputs a hypothesis h such that with probability at least 1-delta, the true risk of h differs from the empirical risk by at most epsilon. Both delta and epsilon appear directly in the definition of convergence in probability.

Let X_1, X_2, ... be iid Bernoulli(p=0.3). How large must n be so that P(|Xbar_n - 0.3| > 0.01) <= 0.05, using Chebyshev?

The answer follows directly from the definition and properties of the object under consideration.

Almost Sure Convergence and the Strong LLN

Convergence in probability guarantees that the error is small with high probability at each fixed n, but it allows infinitely many occasional failures. Almost sure (a.s.) convergence is strictly stronger: the event that X_n does NOT converge to X has probability zero. When SGD with decaying learning rate converges almost surely, it means the optimizer reaches the limit on virtually every random seed, with probability one over the randomness of data shuffling.

The distinction between a.s. and in-probability convergence matters for SGD theory. Online learning algorithms typically guarantee convergence in probability (PAC bounds), while the almost sure convergence of SGD with decaying step sizes is proven using the Robbins-Monro theorem. Both guarantee optimization success, but a.s. convergence provides stronger trajectory-level guarantees.

Construct a sequence X_n that converges in probability to 0 but not almost surely.

The answer follows directly from the definition and properties of the object under consideration.

Central Limit Theorem and Convergence in Distribution

Mini-batch SGD computes gradients on batches of 256 samples. The batch gradient is a sum of 256 independent random vectors. The Central Limit Theorem (CLT) guarantees that this sum is approximately Gaussian regardless of the distribution of individual sample gradients, provided their variance is finite. This is why gradient noise in SGD can be modeled as Gaussian, enabling analysis tools from stochastic calculus.

The delta method is used extensively in ML for constructing confidence intervals around nonlinear functions of model parameters. In neural network calibration research, CLT-based intervals are computed for softmax outputs to assess uncertainty. SGD noise analysis treats the stochastic gradient as a CLT-regime Gaussian perturbation, enabling stability and convergence proofs via Lyapunov functions from stochastic analysis.

Let X_i ~ Bernoulli(0.4), n = 400. Using the CLT, find an approximate 95% confidence interval for the true mean p.

The answer follows directly from the definition and properties of the object under consideration.

Counterexample: in probability without a.s.

On $[0,1]$ with Lebesgue measure, define $X_n = \mathbf{1}_{[k/2^j, (k+1)/2^j]}$ where $n = 2^j + k$, $0 \leq k < 2^j$ (the 'running wave'). Then $P(|X_n| > \varepsilon) = 2^{-j} \to 0$ (convergence in probability), but for every fixed $x \in [0,1]$ the sequence $X_n(x)$ oscillates between 0 and 1 infinitely often. So there is no almost sure convergence.

Итоги

Four convergence types form a hierarchy: a.s.
and $L^p$ are stronger than in probability, which is stronger than in distribution.
The CLT asserts convergence in distribution of the normalized mean.
The delta method extends the CLT to smooth functions of statistics.

Connections to other topics

Concentration of measure (prob-22) strengthens the CLT: instead of distributional convergence you get exponential tail bounds. PAC learning (prob-26) uses convergence in probability to prove generalization. Large deviations theory studies the rate of decay of $P(|\bar X_n - \mu| > \varepsilon)$ exponentially in $n$.

Prob 22 — related
Prob 26 — related

Вопросы для размышления

The delta method gives $\sqrt{n}(\log \bar X_n - \log \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2/\mu^2)$ for $\mu > 0$. How can you use this to build a confidence interval for $\log \mu$ (and then $\mu$) with better coverage than the direct CI for $\mu$?

Связанные уроки

Probability Theory

Modes of Convergence

SGD converges to a critical point 'almost surely' under decreasing step sizes - in the strict sense of a.s. convergence. This is stronger than convergence in probability: it says that a single trajectory of the algorithm converges, not just that most trajectories do.
The CLT asserts convergence in distribution: $(\bar X_n - \mu)/(\sigma/\sqrt{n}) \xrightarrow{d} \mathcal{N}(0,1)$. This underlies confidence intervals - but does not mean $\bar X_n$ itself converges to a normal random variable.
Bootstrap estimates the variance of statistics through convergence in distribution: the bootstrap distribution of $\hat\theta^* - \hat\theta$ approximates that of $\hat\theta - \theta$. The justification is precisely convergence in distribution.

Цели урока

Distinguish four types of convergence: a.s., in probability, in $L^p$, and in distribution
Construct counterexamples showing the non-equivalence of these convergence types
Apply the CLT and the continuous mapping theorem to analyze statistics

Предварительные знания

Random variables and their distributions
Expectation and variance
Lebesgue integral and measurable functions

Hierarchy of convergence types

The central limit theorem and the delta method

Convergence in Probability and the Law of Large Numbers

Let X_1, X_2, ... be iid Bernoulli(p=0.3). How large must n be so that P(|Xbar_n - 0.3| > 0.01) <= 0.05, using Chebyshev?

The answer follows directly from the definition and properties of the object under consideration.

Almost Sure Convergence and the Strong LLN

Construct a sequence X_n that converges in probability to 0 but not almost surely.

The answer follows directly from the definition and properties of the object under consideration.

Central Limit Theorem and Convergence in Distribution

Let X_i ~ Bernoulli(0.4), n = 400. Using the CLT, find an approximate 95% confidence interval for the true mean p.

The answer follows directly from the definition and properties of the object under consideration.

Counterexample: in probability without a.s.

Итоги

Four convergence types form a hierarchy: a.s.
and $L^p$ are stronger than in probability, which is stronger than in distribution.
The CLT asserts convergence in distribution of the normalized mean.
The delta method extends the CLT to smooth functions of statistics.

Connections to other topics

Prob 22 — related
Prob 26 — related

Вопросы для размышления

The delta method gives $\sqrt{n}(\log \bar X_n - \log \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2/\mu^2)$ for $\mu > 0$. How can you use this to build a confidence interval for $\log \mu$ (and then $\mu$) with better coverage than the direct CI for $\mu$?