Probability Theory
Central Limit Theorem
Цели урока
- Understand why the CLT is "the most important theorem in statistics"
- See how ANY distribution converges to normal
- Master the standardization formula for sums
- Apply the CLT to practical calculations
- Know the limits of the theorem's applicability
Предварительные знания
- Normal distribution N(μ, σ²)
- Law of Large Numbers
- Z-score and standardization
**1889.** Francis Galton invents a curious device - the Galton board (quincunx). Balls fall through rows of pegs, bouncing randomly left or right. Each path is unique, chaotic, unpredictable. But the result? **A perfect bell curve**, every single time. Galton called this "the miracle of order from chaos." We call it the Central Limit Theorem.
- Polls: why the margin of error is ±3% with 1,000 respondents
- Human height: the sum of thousands of genetic factors
- Physics: thermal noise as the sum of molecular collisions
- Finance: a portfolio composed of many assets
- ML: why SGD works with mini-batches
A Theorem Three Centuries in the Making
De Moivre (1733) discovered that the binomial distribution approximates a bell curve - his way of avoiding factorial tables. Laplace (1812) generalized the idea. But the **general proof** was given by the Russian mathematician Alexander Lyapunov only in 1901, using characteristic functions. 168 years from discovery to complete understanding!
Central Limit Theorem
A Galton board: a ball bounces left and right through rows of pegs, and a pile of balls forms a **bell-shaped** distribution. This is no accident - it is the **Central Limit Theorem** (CLT) in action. The sum of many independent random influences tends toward a normal distribution.
Formally: for i.i.d. $X_1, X_2, \dots$ with $\mathbb{E}[X]=\mu$ and $\text{Var}(X)=\sigma^2<\infty$, the normalized mean $\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \to \mathcal{N}(0,1)$ as $n\to\infty$. What is remarkable about CLT is its **universality**: no matter the distribution of $X$, the result is always normal.
CLT explains why the normal distribution is the 'king' of statistics: human height, measurement error, data noise - all are sums of many independent factors. Confidence intervals, p-values, and A/B tests are all built on CLT.
The Central Limit Theorem states that for large $n$:
CLT explains the ubiquity of the normal distribution: the sum (or mean) of many independent random terms is asymptotically normal, regardless of their own distribution.
1. The Galton Board - CLT in Action
1. The Galton Board - CLT in Action
Consider: a ball falls through N rows of pegs. At each peg it bounces left (−1) or right (+1) with equal probability.
The ball's final position:
where $X_i = \pm 1$ with probability 0.5 each. This is the sum of N independent random variables!
- N = 1: only 2 positions (no bell curve yet)
- N = 5: the outline of a bell is already visible
- N = 20: nearly a perfect Gaussian
- N = 100: indistinguishable from normal
**That's the miracle of the CLT:** each ball follows its own chaotic path, but the **aggregate** obeys a strict law - the normal distribution.
On a Galton board with 100 rows of pegs, what distribution does the ball's final position follow?
The final position is the sum of 100 independent ±1 values. That's Binomial(100, 0.5), but at n=100 the binomial is virtually indistinguishable from N(0, 10). The CLT in its purest form!
2. Formal Statement
2. Formal Statement
Let $X_1, X_2, \ldots, X_n$ be **independent identically distributed** (i.i.d.) random variables with:
- $E[X_i] = \mu$ - mean
- $Var[X_i] = \sigma^2 < \infty$ - finite variance
**Sum:**
**Standardized sum:**
**Central Limit Theorem:**
**Equivalent statement for the mean:** $\bar{X}_n = S_n / n$ has a distribution close to $N(\mu, \sigma^2/n)$ Standard error of the mean: $SE = \sigma / \sqrt{n}$
X₁, X₂, ..., X₁₀₀ are i.i.d. with μ = 10, σ = 5. What is the approximate distribution of the sum S₁₀₀?
E[S₁₀₀] = 100 × 10 = 1000. Var[S₁₀₀] = 100 × 25 = 2500. By the CLT: S₁₀₀ ≈ N(1000, 2500), i.e. σ = 50.
3. Universality - Why This Is So Surprising
3. Universality - Why This Is So Surprising
The most remarkable thing about the CLT is its **universality**. The distribution of the individual variables does not matter!
| Distribution of Xᵢ | Shape | Sum for large n |
|---|---|---|
| Bernoulli(0.5) | Discrete, 0-1 | N(0.5n, 0.25n) |
| Uniform(0,1) | Flat | N(0.5n, n/12) |
| Exponential(1) | Right-skewed | N(n, n) |
| Poisson(λ) | Discrete, skewed | N(λn, λn) |
| Any with σ² < ∞ | Any shape | ≈ Normal |
All roads lead to the normal distribution!
From a Die to a Bell Curve
Sum of n die rolls
One die: μ = 3.5, σ² = 35/12 ≈ 2.92 Sum of n dice ≈ N(3.5n, 2.92n)
| n | E[Sₙ] | σ | Shape |
|---|---|---|---|
| 1 | 3.5 | 1.7 | Uniform |
| 2 | 7 | 2.4 | Triangular |
| 10 | 35 | 5.4 | Almost a bell |
| 100 | 350 | 17 | Perfect bell curve |
Server request processing time follows an exponential distribution (heavily right-skewed). The average time of 100 requests is...
By the CLT, the mean of 100 independent values (even exponential ones!) converges to normal. The shape of the original distribution does not matter for large n.
4. Practical Applications
4. Practical Applications
Approximating the Binomial
If $X \sim Binomial(n, p)$, then for sufficiently large n:
**Rule of thumb:** the CLT approximation works well when $np \geq 5$ and $n(1-p) \geq 5$.
A/B Test
5% conversion rate, 1,000 visitors
Number of conversions X ~ Binomial(1000, 0.05) By CLT: X ≈ N(50, 47.5) μ = 1000 × 0.05 = 50 σ = √(1000 × 0.05 × 0.95) ≈ 6.9 **P(X ≥ 60) = ?** z = (60 - 50) / 6.9 ≈ 1.45 P(X ≥ 60) = P(Z ≥ 1.45) ≈ 0.074 = 7.4%
Polling Margin of Error
The famous "±3% with 1,000 respondents" comes straight from the CLT!
Public Opinion Poll
Where ±3% comes from
Poll of n = 1,000 people. Assume the true proportion p = 0.5 (worst case). Sample proportion p̂ ≈ N(p, p(1-p)/n) = N(0.5, 0.00025) SE = √(0.5 × 0.5 / 1000) = 0.0158 ≈ 1.6% 95% confidence interval: ±1.96 × SE ≈ ±3.1% **That's where ±3% comes from!**
To reduce the polling margin of error from ±3% to ±1%, how many respondents are needed?
SE ~ 1/√n. To cut the margin of error by a factor of 3, n must increase by a factor of 9. 1,000 × 9 = 9,000 respondents.
5. When the CLT Does NOT Apply
5. When the CLT Does NOT Apply
The CLT is not magic. Conditions apply!
The CLT works for any distribution
The CLT requires finite variance σ² < ∞
For heavy-tailed distributions (Cauchy, Pareto with α ≤ 2) the variance is infinite. The sum of such variables does NOT converge to normal - it follows a Lévy distribution instead.
- **Cauchy distribution:** even the mean is undefined! The sum of n Cauchy variables is still Cauchy
- **Very small n:** for skewed distributions the approximation is poor when n < 30
- **p close to 0 or 1:** for Binomial(n, p) it is better to use the Poisson approximation
Can a normal approximation be used for Binomial(20, 0.02)?
Rule: np ≥ 5 and n(1-p) ≥ 5. Here np = 20 × 0.02 = 0.4 < 5. The approximation will be poor. Better to use Poisson(0.4) or the exact formula.
Practice
Practice
An elevator holds 1,000 kg. 15 office workers need to ride it; average weight 70 kg, σ = 15 kg. What is the probability their combined weight exceeds the limit?
S₁₅ ≈ N(1050, 3375), σ ≈ 58 kg z = (1000 - 1050) / 58 ≈ -0.86 P(S > 1000) = P(Z > -0.86) = 1 - P(Z < -0.86) ≈ 1 - 0.195 = 0.805 **80% probability!** The elevator is too weak for 15 people.
A coin is tossed 400 times. Find the probability of getting between 185 and 215 heads.
X ~ Bin(400, 0.5) μ = 200, σ = √(400 × 0.25) = 10 With continuity correction: P(184.5 < X < 215.5) z₁ = (184.5 - 200) / 10 = -1.55 z₂ = (215.5 - 200) / 10 = 1.55 P = Φ(1.55) - Φ(-1.55) = 0.939 - 0.061 = **0.878 ≈ 88%**
An insurance company holds 10,000 policies. The probability of a claim is 1%; payout is $50,000. The average premium must cover 99% of scenarios. What is the minimum premium?
Number of claims X ~ Bin(10000, 0.01) ≈ N(100, 99) σ_X ≈ 9.95 99th percentile: X₀.₉₉ = 100 + 2.33 × 9.95 ≈ 123.2 Max payout in 99% of scenarios: 123.2 × $50,000 = $6.16M Min premium per policy: $6.16M / 10,000 = **$616** (At E[payout] = 100 × $50,000 = $5M, a $500 premium would have been too low!)
A fair coin is tossed 400 times. By the CLT, the number of heads $X \approx N(200, 100)$. What is the probability $185 \leq X \leq 215$ (with continuity correction)?
$\mu = 200$, $\sigma = \sqrt{400\cdot 0.25} = 10$. With continuity correction $z = \pm 15.5/10 = \pm 1.55$, so $\Phi(1.55) - \Phi(-1.55) \approx 0.878$. The CLT turns the binomial into a normal at large $n$.
The CLT - The Crown of Probability Theory
This theorem unifies everything studied so far and opens the door to statistics.
- Confidence Intervals — Built directly on the CLT
- Hypothesis Testing — Z-tests and t-tests rely on normality
- Regression — OLS estimators are normal by the CLT
- Bayesian Statistics — The posterior distribution is often approximately normal
- Machine Learning — SGD, BatchNorm, weight initialization - the CLT is everywhere
Итоги
- **CLT:** the sum of n i.i.d. variables → normal distribution as n → ∞
- **Universality:** holds for ANY distribution with finite variance
- **Formula:** $Z_n = \frac{S_n - n\mu}{\sigma\sqrt{n}} \to N(0, 1)$
- **Standard error:** $SE = \sigma/\sqrt{n}$ - decreases slowly!
- **In practice:** Bin(n, p) ≈ N(np, npq) when np ≥ 5 and nq ≥ 5
- **Limitations:** does not apply when variance is infinite (e.g., Cauchy)
Вопросы для размышления
- Return to the Galton board: how does it visually demonstrate "order from chaos"?
- Why does the CLT explain why height, IQ, and measurement errors are normally distributed?
- A sociologist wants to reduce a poll's margin of error from 3% to 1%. By how much will the cost of the survey increase?
- What do the CLT and mini-batch SGD in neural networks have in common?