Probability Theory

Central Limit Theorem

Цели урока

Understand why the CLT is "the most important theorem in statistics"
See how ANY distribution converges to normal
Master the standardization formula for sums
Apply the CLT to practical calculations
Know the limits of the theorem's applicability

Предварительные знания

Normal distribution N(μ, σ²)
Law of Large Numbers
Z-score and standardization

**1889.** Francis Galton invents a curious device - the Galton board (quincunx). Balls fall through rows of pegs, bouncing randomly left or right. Each path is unique, chaotic, unpredictable. But the result? **A perfect bell curve**, every single time. Galton called this "the miracle of order from chaos." We call it the Central Limit Theorem.

Polls: why the margin of error is ±3% with 1,000 respondents
Human height: the sum of thousands of genetic factors
Physics: thermal noise as the sum of molecular collisions
Finance: a portfolio composed of many assets
ML: why SGD works with mini-batches

A Theorem Three Centuries in the Making

De Moivre (1733) discovered that the binomial distribution approximates a bell curve - his way of avoiding factorial tables. Laplace (1812) generalized the idea. But the **general proof** was given by the Russian mathematician Alexander Lyapunov only in 1901, using characteristic functions. 168 years from discovery to complete understanding!

Central Limit Theorem

A Galton board: a ball bounces left and right through rows of pegs, and a pile of balls forms a **bell-shaped** distribution. This is no accident - it is the **Central Limit Theorem** (CLT) in action. The sum of many independent random influences tends toward a normal distribution.

Formally: for i.i.d. $X_1, X_2, \dots$ with $\mathbb{E}[X]=\mu$ and $\text{Var}(X)=\sigma^2<\infty$, the normalized mean $\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \to \mathcal{N}(0,1)$ as $n\to\infty$. What is remarkable about CLT is its **universality**: no matter the distribution of $X$, the result is always normal.

CLT explains why the normal distribution is the 'king' of statistics: human height, measurement error, data noise - all are sums of many independent factors. Confidence intervals, p-values, and A/B tests are all built on CLT.

The Central Limit Theorem states that for large $n$:

CLT explains the ubiquity of the normal distribution: the sum (or mean) of many independent random terms is asymptotically normal, regardless of their own distribution.

1. The Galton Board - CLT in Action

Consider: a ball falls through N rows of pegs. At each peg it bounces left (−1) or right (+1) with equal probability.

The ball's final position:

where $X_i = \pm 1$ with probability 0.5 each. This is the sum of N independent random variables!

N = 1: only 2 positions (no bell curve yet)
N = 5: the outline of a bell is already visible
N = 20: nearly a perfect Gaussian
N = 100: indistinguishable from normal

**That's the miracle of the CLT:** each ball follows its own chaotic path, but the **aggregate** obeys a strict law - the normal distribution.

On a Galton board with 100 rows of pegs, what distribution does the ball's final position follow?

The final position is the sum of 100 independent ±1 values. That's Binomial(100, 0.5), but at n=100 the binomial is virtually indistinguishable from N(0, 10). The CLT in its purest form!

2. Formal Statement

Let $X_1, X_2, \ldots, X_n$ be **independent identically distributed** (i.i.d.) random variables with:

$E[X_i] = \mu$ - mean
$Var[X_i] = \sigma^2 < \infty$ - finite variance

**Sum:**

**Standardized sum:**

**Central Limit Theorem:**

**Equivalent statement for the mean:** $\bar{X}_n = S_n / n$ has a distribution close to $N(\mu, \sigma^2/n)$ Standard error of the mean: $SE = \sigma / \sqrt{n}$

X₁, X₂, ..., X₁₀₀ are i.i.d. with μ = 10, σ = 5. What is the approximate distribution of the sum S₁₀₀?

E[S₁₀₀] = 100 × 10 = 1000. Var[S₁₀₀] = 100 × 25 = 2500. By the CLT: S₁₀₀ ≈ N(1000, 2500), i.e. σ = 50.

3. Universality - Why This Is So Surprising

The most remarkable thing about the CLT is its **universality**. The distribution of the individual variables does not matter!

Distribution of Xᵢ	Shape	Sum for large n
Bernoulli(0.5)	Discrete, 0-1	N(0.5n, 0.25n)
Uniform(0,1)	Flat	N(0.5n, n/12)
Exponential(1)	Right-skewed	N(n, n)
Poisson(λ)	Discrete, skewed	N(λn, λn)
Any with σ² < ∞	Any shape	≈ Normal

All roads lead to the normal distribution!

From a Die to a Bell Curve

Sum of n die rolls

One die: μ = 3.5, σ² = 35/12 ≈ 2.92 Sum of n dice ≈ N(3.5n, 2.92n)

n	E[Sₙ]	σ	Shape
1	3.5	1.7	Uniform
2	7	2.4	Triangular
10	35	5.4	Almost a bell
100	350	17	Perfect bell curve

Server request processing time follows an exponential distribution (heavily right-skewed). The average time of 100 requests is...

By the CLT, the mean of 100 independent values (even exponential ones!) converges to normal. The shape of the original distribution does not matter for large n.

4. Practical Applications

Approximating the Binomial

If $X \sim Binomial(n, p)$, then for sufficiently large n:

**Rule of thumb:** the CLT approximation works well when $np \geq 5$ and $n(1-p) \geq 5$.

A/B Test

5% conversion rate, 1,000 visitors

Number of conversions X ~ Binomial(1000, 0.05) By CLT: X ≈ N(50, 47.5) μ = 1000 × 0.05 = 50 σ = √(1000 × 0.05 × 0.95) ≈ 6.9 **P(X ≥ 60) = ?** z = (60 - 50) / 6.9 ≈ 1.45 P(X ≥ 60) = P(Z ≥ 1.45) ≈ 0.074 = 7.4%

Polling Margin of Error

The famous "±3% with 1,000 respondents" comes straight from the CLT!

Public Opinion Poll

Where ±3% comes from

Poll of n = 1,000 people. Assume the true proportion p = 0.5 (worst case). Sample proportion p̂ ≈ N(p, p(1-p)/n) = N(0.5, 0.00025) SE = √(0.5 × 0.5 / 1000) = 0.0158 ≈ 1.6% 95% confidence interval: ±1.96 × SE ≈ ±3.1% **That's where ±3% comes from!**

To reduce the polling margin of error from ±3% to ±1%, how many respondents are needed?

SE ~ 1/√n. To cut the margin of error by a factor of 3, n must increase by a factor of 9. 1,000 × 9 = 9,000 respondents.

5. When the CLT Does NOT Apply

The CLT is not magic. Conditions apply!

The CLT works for any distribution

The CLT requires finite variance σ² < ∞

For heavy-tailed distributions (Cauchy, Pareto with α ≤ 2) the variance is infinite. The sum of such variables does NOT converge to normal - it follows a Lévy distribution instead.

**Cauchy distribution:** even the mean is undefined! The sum of n Cauchy variables is still Cauchy
**Very small n:** for skewed distributions the approximation is poor when n < 30
**p close to 0 or 1:** for Binomial(n, p) it is better to use the Poisson approximation

Can a normal approximation be used for Binomial(20, 0.02)?

Rule: np ≥ 5 and n(1-p) ≥ 5. Here np = 20 × 0.02 = 0.4 < 5. The approximation will be poor. Better to use Poisson(0.4) or the exact formula.

Practice

An elevator holds 1,000 kg. 15 office workers need to ride it; average weight 70 kg, σ = 15 kg. What is the probability their combined weight exceeds the limit?

S₁₅ ≈ N(1050, 3375), σ ≈ 58 kg z = (1000 - 1050) / 58 ≈ -0.86 P(S > 1000) = P(Z > -0.86) = 1 - P(Z < -0.86) ≈ 1 - 0.195 = 0.805 **80% probability!** The elevator is too weak for 15 people.

A coin is tossed 400 times. Find the probability of getting between 185 and 215 heads.

X ~ Bin(400, 0.5) μ = 200, σ = √(400 × 0.25) = 10 With continuity correction: P(184.5 < X < 215.5) z₁ = (184.5 - 200) / 10 = -1.55 z₂ = (215.5 - 200) / 10 = 1.55 P = Φ(1.55) - Φ(-1.55) = 0.939 - 0.061 = **0.878 ≈ 88%**

An insurance company holds 10,000 policies. The probability of a claim is 1%; payout is $50,000. The average premium must cover 99% of scenarios. What is the minimum premium?

Number of claims X ~ Bin(10000, 0.01) ≈ N(100, 99) σ_X ≈ 9.95 99th percentile: X₀.₉₉ = 100 + 2.33 × 9.95 ≈ 123.2 Max payout in 99% of scenarios: 123.2 × $50,000 = $6.16M Min premium per policy: $6.16M / 10,000 = **$616** (At E[payout] = 100 × $50,000 = $5M, a $500 premium would have been too low!)

A fair coin is tossed 400 times. By the CLT, the number of heads $X \approx N(200, 100)$. What is the probability $185 \leq X \leq 215$ (with continuity correction)?

$\mu = 200$, $\sigma = \sqrt{400\cdot 0.25} = 10$. With continuity correction $z = \pm 15.5/10 = \pm 1.55$, so $\Phi(1.55) - \Phi(-1.55) \approx 0.878$. The CLT turns the binomial into a normal at large $n$.

The CLT - The Crown of Probability Theory

This theorem unifies everything studied so far and opens the door to statistics.

Confidence Intervals — Built directly on the CLT
Hypothesis Testing — Z-tests and t-tests rely on normality
Regression — OLS estimators are normal by the CLT
Bayesian Statistics — The posterior distribution is often approximately normal
Machine Learning — SGD, BatchNorm, weight initialization - the CLT is everywhere

Итоги

**CLT:** the sum of n i.i.d. variables → normal distribution as n → ∞
**Universality:** holds for ANY distribution with finite variance
**Formula:** $Z_n = \frac{S_n - n\mu}{\sigma\sqrt{n}} \to N(0, 1)$
**Standard error:** $SE = \sigma/\sqrt{n}$ - decreases slowly!
**In practice:** Bin(n, p) ≈ N(np, npq) when np ≥ 5 and nq ≥ 5
**Limitations:** does not apply when variance is infinite (e.g., Cauchy)

Вопросы для размышления

Return to the Galton board: how does it visually demonstrate "order from chaos"?
Why does the CLT explain why height, IQ, and measurement errors are normally distributed?
A sociologist wants to reduce a poll's margin of error from 3% to 1%. By how much will the cost of the survey increase?
What do the CLT and mini-batch SGD in neural networks have in common?

Связанные уроки

Probability Theory

Central Limit Theorem

Цели урока

Understand why the CLT is "the most important theorem in statistics"
See how ANY distribution converges to normal
Master the standardization formula for sums
Apply the CLT to practical calculations
Know the limits of the theorem's applicability

Предварительные знания

Normal distribution N(μ, σ²)
Law of Large Numbers
Z-score and standardization

Polls: why the margin of error is ±3% with 1,000 respondents
Human height: the sum of thousands of genetic factors
Physics: thermal noise as the sum of molecular collisions
Finance: a portfolio composed of many assets
ML: why SGD works with mini-batches

A Theorem Three Centuries in the Making

Central Limit Theorem

The Central Limit Theorem states that for large $n$:

CLT explains the ubiquity of the normal distribution: the sum (or mean) of many independent random terms is asymptotically normal, regardless of their own distribution.

1. The Galton Board - CLT in Action

Consider: a ball falls through N rows of pegs. At each peg it bounces left (−1) or right (+1) with equal probability.

The ball's final position:

where $X_i = \pm 1$ with probability 0.5 each. This is the sum of N independent random variables!

N = 1: only 2 positions (no bell curve yet)
N = 5: the outline of a bell is already visible
N = 20: nearly a perfect Gaussian
N = 100: indistinguishable from normal

**That's the miracle of the CLT:** each ball follows its own chaotic path, but the **aggregate** obeys a strict law - the normal distribution.

On a Galton board with 100 rows of pegs, what distribution does the ball's final position follow?

The final position is the sum of 100 independent ±1 values. That's Binomial(100, 0.5), but at n=100 the binomial is virtually indistinguishable from N(0, 10). The CLT in its purest form!

2. Formal Statement

Let $X_1, X_2, \ldots, X_n$ be **independent identically distributed** (i.i.d.) random variables with:

$E[X_i] = \mu$ - mean
$Var[X_i] = \sigma^2 < \infty$ - finite variance

**Sum:**

**Standardized sum:**

**Central Limit Theorem:**

**Equivalent statement for the mean:** $\bar{X}_n = S_n / n$ has a distribution close to $N(\mu, \sigma^2/n)$ Standard error of the mean: $SE = \sigma / \sqrt{n}$

X₁, X₂, ..., X₁₀₀ are i.i.d. with μ = 10, σ = 5. What is the approximate distribution of the sum S₁₀₀?

E[S₁₀₀] = 100 × 10 = 1000. Var[S₁₀₀] = 100 × 25 = 2500. By the CLT: S₁₀₀ ≈ N(1000, 2500), i.e. σ = 50.

3. Universality - Why This Is So Surprising

The most remarkable thing about the CLT is its **universality**. The distribution of the individual variables does not matter!

Distribution of Xᵢ	Shape	Sum for large n
Bernoulli(0.5)	Discrete, 0-1	N(0.5n, 0.25n)
Uniform(0,1)	Flat	N(0.5n, n/12)
Exponential(1)	Right-skewed	N(n, n)
Poisson(λ)	Discrete, skewed	N(λn, λn)
Any with σ² < ∞	Any shape	≈ Normal

All roads lead to the normal distribution!

From a Die to a Bell Curve

Sum of n die rolls

One die: μ = 3.5, σ² = 35/12 ≈ 2.92 Sum of n dice ≈ N(3.5n, 2.92n)

n	E[Sₙ]	σ	Shape
1	3.5	1.7	Uniform
2	7	2.4	Triangular
10	35	5.4	Almost a bell
100	350	17	Perfect bell curve

Server request processing time follows an exponential distribution (heavily right-skewed). The average time of 100 requests is...

By the CLT, the mean of 100 independent values (even exponential ones!) converges to normal. The shape of the original distribution does not matter for large n.

4. Practical Applications

Approximating the Binomial

If $X \sim Binomial(n, p)$, then for sufficiently large n:

**Rule of thumb:** the CLT approximation works well when $np \geq 5$ and $n(1-p) \geq 5$.

A/B Test

5% conversion rate, 1,000 visitors

Polling Margin of Error

The famous "±3% with 1,000 respondents" comes straight from the CLT!

Public Opinion Poll

Where ±3% comes from

To reduce the polling margin of error from ±3% to ±1%, how many respondents are needed?

SE ~ 1/√n. To cut the margin of error by a factor of 3, n must increase by a factor of 9. 1,000 × 9 = 9,000 respondents.

5. When the CLT Does NOT Apply

The CLT is not magic. Conditions apply!

The CLT works for any distribution

The CLT requires finite variance σ² < ∞

For heavy-tailed distributions (Cauchy, Pareto with α ≤ 2) the variance is infinite. The sum of such variables does NOT converge to normal - it follows a Lévy distribution instead.

**Cauchy distribution:** even the mean is undefined! The sum of n Cauchy variables is still Cauchy
**Very small n:** for skewed distributions the approximation is poor when n < 30
**p close to 0 or 1:** for Binomial(n, p) it is better to use the Poisson approximation

Can a normal approximation be used for Binomial(20, 0.02)?

Rule: np ≥ 5 and n(1-p) ≥ 5. Here np = 20 × 0.02 = 0.4 < 5. The approximation will be poor. Better to use Poisson(0.4) or the exact formula.

Practice

An elevator holds 1,000 kg. 15 office workers need to ride it; average weight 70 kg, σ = 15 kg. What is the probability their combined weight exceeds the limit?

A coin is tossed 400 times. Find the probability of getting between 185 and 215 heads.

An insurance company holds 10,000 policies. The probability of a claim is 1%; payout is $50,000. The average premium must cover 99% of scenarios. What is the minimum premium?

A fair coin is tossed 400 times. By the CLT, the number of heads $X \approx N(200, 100)$. What is the probability $185 \leq X \leq 215$ (with continuity correction)?

The CLT - The Crown of Probability Theory

This theorem unifies everything studied so far and opens the door to statistics.

Confidence Intervals — Built directly on the CLT
Hypothesis Testing — Z-tests and t-tests rely on normality
Regression — OLS estimators are normal by the CLT
Bayesian Statistics — The posterior distribution is often approximately normal
Machine Learning — SGD, BatchNorm, weight initialization - the CLT is everywhere

Итоги

**CLT:** the sum of n i.i.d. variables → normal distribution as n → ∞
**Universality:** holds for ANY distribution with finite variance
**Formula:** $Z_n = \frac{S_n - n\mu}{\sigma\sqrt{n}} \to N(0, 1)$
**Standard error:** $SE = \sigma/\sqrt{n}$ - decreases slowly!
**In practice:** Bin(n, p) ≈ N(np, npq) when np ≥ 5 and nq ≥ 5
**Limitations:** does not apply when variance is infinite (e.g., Cauchy)

Вопросы для размышления

Return to the Galton board: how does it visually demonstrate "order from chaos"?
Why does the CLT explain why height, IQ, and measurement errors are normally distributed?
A sociologist wants to reduce a poll's margin of error from 3% to 1%. By how much will the cost of the survey increase?
What do the CLT and mini-batch SGD in neural networks have in common?