Probability Theory

Law of Large Numbers

Цели урока

Build intuition for the law of large numbers through examples
Understand why the sample mean stabilizes as the sample grows
Distinguish the weak and strong LLN
Understand the connection to the frequentist definition of probability
See the LLN in action: casinos, insurance, ML

Предварительные знания

Expected value E[X]
Variance Var[X]
Independent random variables
Chebyshev's inequality

**1940, a wartime internment camp in Denmark.** South African mathematician John Kerrich, stuck with nothing better to do, ran an experiment. He flipped a coin **10,000 times** and recorded every result. After 10 flips, the heads fraction was 40%. After 100, 44%. After 1,000, 50.2%. After 10,000, 50.67%. The coin was "converging" to fair. This was the Law of Large Numbers in action.

Casinos: why "the house always wins"
Insurance: predictability with millions of policies
Polls: why 1,000 people can represent millions
A/B tests: how long to wait for a significant result
ML: why mini-batch gradient descent works

The golden theorem: 20 years of work

Jacob Bernoulli spent **20 years** proving what seems intuitive today. He called the result his "golden theorem," and it was published posthumously in *Ars Conjectandi* (1713). This was the first rigorous link between **theory** (probability) and **practice** (frequency). Bernoulli's point: even a fool sees this intuitively, but proving it is another matter entirely.

Law of Large Numbers

Flip a coin once - heads or tails, huge spread. Flip it 10,000 times - the share of heads will be very close to 0.5. The fact is intuitive but requires a rigorous formulation: the **Law of Large Numbers** (LLN) is one of the founding results of probability theory.

Formally: for i.i.d. random variables $X_1, X_2, \dots$ with finite $\mathbb{E}[X]=\mu$, the sample mean $\bar{X}_n = \frac{1}{n}\sum X_i$ converges to $\mu$ as $n\to\infty$. This is the **weak LLN** (convergence in probability) or the **strong LLN** (almost-sure convergence).

The LLN is the theoretical justification of Monte Carlo methods in ML, empirical risk in supervised learning, and gradient estimation in SGD. Without it we could not justify the leap from 'sample' to 'population'.

The Law of Large Numbers states that:

LLN is the foundation of empirical estimation: frequency converges to probability, sample mean to expectation. It justifies Monte Carlo and statistical estimation in big data.

1. Intuition: what happens as the sample grows?

Suppose we flip a fair coin and track the heads fraction:

Flips	Heads	Fraction	Deviation from 0.5
10	7	0.70	0.20
100	56	0.56	0.06
1,000	512	0.512	0.012
10,000	5,067	0.5067	0.0067
100,000	50,121	0.50121	0.00121

See the pattern? The deviation **decreases** approximately as $1/\sqrt{n}$.

**Key insight:** when the sample grows by a factor of 100, the deviation decreases by about a factor of 10 (by $\sqrt{100}$).

After 10,000 flips, the heads fraction is 0.52. After another 90,000 flips (100,000 total), what deviation from 0.5 is expected?

The sample grew by a factor of 10 (from 10k to 100k). The typical deviation decreases by √10 ≈ 3.16. Was ~0.02, becomes ~0.0063. LLN in action!

2. Formal statement

Let $X_1, X_2, X_3, \ldots$ be **independent identically distributed** (i.i.d.) random variables with:

$E[X_i] = \mu$ (finite expected value)
$Var[X_i] = \sigma^2 < \infty$ (finite variance, needed for the Chebyshev proof below; the strong LLN actually goes through with just $E[|X_i|] < \infty$)

**Sample mean:**

**Law of Large Numbers:** as $n \to \infty$, the sample mean converges to the expected value:

Weak vs. Strong LLN

There are two versions of the theorem; they differ in the **type of convergence**:

Version	Authors	Conditions	Statement
Weak LLN	Bernoulli (1713), Khintchine (1929)	E[\|X\|] < ∞	P(\|X̄ₙ - μ\| > ε) → 0
Strong LLN	Kolmogorov (1930)	E[\|X\|] < ∞	P(X̄ₙ → μ) = 1

**The difference:** the weak version says the probability of a large deviation goes to 0. The strong version says the entire sample path of X̄ₙ almost surely converges to μ. In practice the difference rarely matters.

What does "convergence in probability" mean in the weak LLN?

Convergence in probability: no matter how small ε is, the probability of deviating from μ by more than ε approaches zero. But that doesn't guarantee deviations never occur.

3. Why does it work? The math behind the intuition

Compute the moments of the sample mean:

Expected value of X̄ₙ

The sample mean is an **unbiased** estimator of μ: on average, it hits μ exactly.

Variance of X̄ₙ: the key to the LLN

The variance of the sample mean **decreases as 1/n**. That is the heart of the LLN.

**Standard error of the mean:** $SE = \frac{\sigma}{\sqrt{n}}$ It decreases as $1/\sqrt{n}$. To cut the error in half, four times as many observations are required.

Proof of the weak LLN (via Chebyshev)

By Chebyshev's inequality:

The probability of a large deviation goes to zero as $n \to \infty$. ∎

X has σ = 10. What is the standard error of the mean when n = 100?

SE = σ/√n = 10/√100 = 10/10 = 1. With 100 observations, the standard error is 10 times smaller than the standard deviation of a single observation.

4. Connection to the frequentist definition of probability

The LLN **justifies** the frequentist definition of probability!

If event A has probability p, then its relative frequency in n independent trials:

This is a special case of the LLN: let $X_i = 1$ if A occurred, 0 otherwise. Then $E[X_i] = p$ and $\bar{X}_n$ is the relative frequency of A.

Kerrich's data

A real experiment from his wartime internment

Kerrich recorded results in batches:

Flips	Heads	Fraction
10	4	0.400
50	25	0.500
100	44	0.440
500	255	0.510
1,000	502	0.502
5,000	2,533	0.5066
10,000	5,067	0.5067

The frequency stabilizes around 0.5, but not monotonically.

The probability of rolling a 6 is 1/6 ≈ 0.167. After 600 rolls, 120 sixes appeared (20%). Does this contradict the LLN?

SE = √(p(1-p)/n) = √(0.167×0.833/600) ≈ 0.015. The deviation 20% - 16.7% = 3.3% is about 2.2 standard errors. Rare, but not impossible. The LLN is a limit statement (n → ∞), not a guarantee for any specific n.

5. The main error: "The Law of Averages"

After 10 heads in a row, tails "must" come up to compensate

Each flip is independent. The LLN works through dilution, not compensation

10 heads in a row is just 10 flips. After 10,000 flips, their share of the total is 10/10000 = 0.1%. New flips dilute the past; they don't correct it.

Dilution vs. compensation

Why the LLN is not about compensation

After 10 heads in a row: fraction = 10/10 = 100%. Scenario 1 (compensation): the next 10 are tails. Fraction = 10/20 = 50%. **Wrong intuition.** Scenario 2 (dilution): the next 9,990 flips are 50% heads. Fraction = (10 + 4,995) / 10,000 = 50.05%. The LLN drowns the 10 heads in the mass of new data; it does not compensate for them.

**Gambler's Fallacy:** the belief that past outcomes affect future ones in independent trials. The coin **does not remember** what came before!

At roulette, red has come up 5 times in a row. What can be said about the next spin?

Roulette spins are independent. P(red) = 18/37 every time. Past spins do not affect future ones. Believing they do is the gambler's fallacy.

Practice

A factory produces components with a 3% defect rate. An inspector checks 900 components. What is the probability that the fraction of defects deviates from 3% by more than 1%?

$SE = \sqrt{\frac{0.03 \times 0.97}{900}} = \sqrt{\frac{0.0291}{900}} \approx 0.0057$ Deviation of 1% = 0.01. In SE units: z = 0.01 / 0.0057 ≈ 1.75 P(|X̄ - 0.03| > 0.01) = P(|Z| > 1.75) ≈ 2 × (1 - 0.96) = **8%** With ~92% probability, the fraction will be between 2% and 4%.

A/B test: control (A) shows 5% conversion, test (B) shows 6%. How many users per group are needed for the difference to be statistically meaningful (SE < 0.5%)?

$SE_{diff} \approx \sqrt{\frac{2 \times 0.055 \times 0.945}{n}} = \sqrt{\frac{0.104}{n}}$ Need $SE < 0.005$: $\sqrt{\frac{0.104}{n}} < 0.005$ $\frac{0.104}{n} < 0.000025$ $n > \frac{0.104}{0.000025} = 4160$ **~4,200 users per group** minimum.

A casino has a 2% edge in every game. The average bet is $100. What is the casino's approximate profit from 1 million games? Why is the casino confident in this?

$E[\text{profit per game}] = 0.02 \times 100\,\text{USD} = 2\,\text{USD}$ $E[\text{profit from 1M games}] = 1{,}000{,}000 \times 2\,\text{USD} = 2{,}000{,}000\,\text{USD}$ By the LLN, the average profit per game converges to 2 USD. Over a million games, fluctuations are negligible compared to the total. **The casino plays the long game**, and the LLN guarantees its profit.

A casino has a 2% edge on a $\$100$ bet and runs 1,000,000 plays. Why does the LLN make a profit of about $\$2{,}000{,}000$ near-certain?

The LLN sends the per-play average to $E[X] = \$2$. The standard error of the mean shrinks as $\sigma/\sqrt{n}$, so at $n = 10^6$ deviations from $\$2{,}000{,}000$ are tiny in relative terms - the casino's business model rests on this volume.

LLN: the foundation of statistics

The law of large numbers underlies almost every statistical method.

CLT — The next step: not just convergence, but the shape of the distribution
Parameter estimation — Consistency of estimators is based on the LLN
Monte Carlo method — Estimating integrals through random sampling
ML: SGD — Mini-batch gradient converges to the full gradient by the LLN
Insurance — Predictability with a large number of policies

Итоги

**LLN:** $\bar{X}_n \to \mu$ as $n \to \infty$
**Why it works:** $Var[\bar{X}_n] = \sigma^2/n \to 0$
**Standard error:** $SE = \sigma/\sqrt{n}$ - decreases as $1/\sqrt{n}$
**Frequency converges to probability:** the justification for the frequentist interpretation
**Not compensation:** the past is diluted, not corrected

Вопросы для размышления

Back to Kerrich: why was his experiment important for science, even though the LLN had already been proven theoretically?
An insurance company pays an average of $10,000 per policy, but individual payouts range from $0 to $1 million. Why is the company confident in its calculations?
If the LLN speaks of convergence as n → ∞, how is it applied in practice with a finite n?
Why does cutting the error in half require four times as many observations, not just twice as many?

Связанные уроки

Probability Theory

Law of Large Numbers

Цели урока

Build intuition for the law of large numbers through examples
Understand why the sample mean stabilizes as the sample grows
Distinguish the weak and strong LLN
Understand the connection to the frequentist definition of probability
See the LLN in action: casinos, insurance, ML

Предварительные знания

Expected value E[X]
Variance Var[X]
Independent random variables
Chebyshev's inequality

Casinos: why "the house always wins"
Insurance: predictability with millions of policies
Polls: why 1,000 people can represent millions
A/B tests: how long to wait for a significant result
ML: why mini-batch gradient descent works

The golden theorem: 20 years of work

Law of Large Numbers

The Law of Large Numbers states that:

LLN is the foundation of empirical estimation: frequency converges to probability, sample mean to expectation. It justifies Monte Carlo and statistical estimation in big data.

1. Intuition: what happens as the sample grows?

Suppose we flip a fair coin and track the heads fraction:

Flips	Heads	Fraction	Deviation from 0.5
10	7	0.70	0.20
100	56	0.56	0.06
1,000	512	0.512	0.012
10,000	5,067	0.5067	0.0067
100,000	50,121	0.50121	0.00121

See the pattern? The deviation **decreases** approximately as $1/\sqrt{n}$.

**Key insight:** when the sample grows by a factor of 100, the deviation decreases by about a factor of 10 (by $\sqrt{100}$).

After 10,000 flips, the heads fraction is 0.52. After another 90,000 flips (100,000 total), what deviation from 0.5 is expected?

The sample grew by a factor of 10 (from 10k to 100k). The typical deviation decreases by √10 ≈ 3.16. Was ~0.02, becomes ~0.0063. LLN in action!

2. Formal statement

Let $X_1, X_2, X_3, \ldots$ be **independent identically distributed** (i.i.d.) random variables with:

$E[X_i] = \mu$ (finite expected value)
$Var[X_i] = \sigma^2 < \infty$ (finite variance, needed for the Chebyshev proof below; the strong LLN actually goes through with just $E[|X_i|] < \infty$)

**Sample mean:**

**Law of Large Numbers:** as $n \to \infty$, the sample mean converges to the expected value:

Weak vs. Strong LLN

There are two versions of the theorem; they differ in the **type of convergence**:

Version	Authors	Conditions	Statement
Weak LLN	Bernoulli (1713), Khintchine (1929)	E[\|X\|] < ∞	P(\|X̄ₙ - μ\| > ε) → 0
Strong LLN	Kolmogorov (1930)	E[\|X\|] < ∞	P(X̄ₙ → μ) = 1

What does "convergence in probability" mean in the weak LLN?

Convergence in probability: no matter how small ε is, the probability of deviating from μ by more than ε approaches zero. But that doesn't guarantee deviations never occur.

3. Why does it work? The math behind the intuition

Compute the moments of the sample mean:

Expected value of X̄ₙ

The sample mean is an **unbiased** estimator of μ: on average, it hits μ exactly.

Variance of X̄ₙ: the key to the LLN

The variance of the sample mean **decreases as 1/n**. That is the heart of the LLN.

**Standard error of the mean:** $SE = \frac{\sigma}{\sqrt{n}}$ It decreases as $1/\sqrt{n}$. To cut the error in half, four times as many observations are required.

Proof of the weak LLN (via Chebyshev)

By Chebyshev's inequality:

The probability of a large deviation goes to zero as $n \to \infty$. ∎

X has σ = 10. What is the standard error of the mean when n = 100?

SE = σ/√n = 10/√100 = 10/10 = 1. With 100 observations, the standard error is 10 times smaller than the standard deviation of a single observation.

4. Connection to the frequentist definition of probability

The LLN **justifies** the frequentist definition of probability!

If event A has probability p, then its relative frequency in n independent trials:

This is a special case of the LLN: let $X_i = 1$ if A occurred, 0 otherwise. Then $E[X_i] = p$ and $\bar{X}_n$ is the relative frequency of A.

Kerrich's data

A real experiment from his wartime internment

Kerrich recorded results in batches:

Flips	Heads	Fraction
10	4	0.400
50	25	0.500
100	44	0.440
500	255	0.510
1,000	502	0.502
5,000	2,533	0.5066
10,000	5,067	0.5067

The frequency stabilizes around 0.5, but not monotonically.

The probability of rolling a 6 is 1/6 ≈ 0.167. After 600 rolls, 120 sixes appeared (20%). Does this contradict the LLN?

5. The main error: "The Law of Averages"

After 10 heads in a row, tails "must" come up to compensate

Each flip is independent. The LLN works through dilution, not compensation

10 heads in a row is just 10 flips. After 10,000 flips, their share of the total is 10/10000 = 0.1%. New flips dilute the past; they don't correct it.

Dilution vs. compensation

Why the LLN is not about compensation

**Gambler's Fallacy:** the belief that past outcomes affect future ones in independent trials. The coin **does not remember** what came before!

At roulette, red has come up 5 times in a row. What can be said about the next spin?

Roulette spins are independent. P(red) = 18/37 every time. Past spins do not affect future ones. Believing they do is the gambler's fallacy.

Practice

A factory produces components with a 3% defect rate. An inspector checks 900 components. What is the probability that the fraction of defects deviates from 3% by more than 1%?

A/B test: control (A) shows 5% conversion, test (B) shows 6%. How many users per group are needed for the difference to be statistically meaningful (SE < 0.5%)?

A casino has a 2% edge in every game. The average bet is $100. What is the casino's approximate profit from 1 million games? Why is the casino confident in this?

A casino has a 2% edge on a $\$100$ bet and runs 1,000,000 plays. Why does the LLN make a profit of about $\$2{,}000{,}000$ near-certain?

LLN: the foundation of statistics

The law of large numbers underlies almost every statistical method.

CLT — The next step: not just convergence, but the shape of the distribution
Parameter estimation — Consistency of estimators is based on the LLN
Monte Carlo method — Estimating integrals through random sampling
ML: SGD — Mini-batch gradient converges to the full gradient by the LLN
Insurance — Predictability with a large number of policies

Итоги

**LLN:** $\bar{X}_n \to \mu$ as $n \to \infty$
**Why it works:** $Var[\bar{X}_n] = \sigma^2/n \to 0$
**Standard error:** $SE = \sigma/\sqrt{n}$ - decreases as $1/\sqrt{n}$
**Frequency converges to probability:** the justification for the frequentist interpretation
**Not compensation:** the past is diluted, not corrected

Вопросы для размышления

Back to Kerrich: why was his experiment important for science, even though the LLN had already been proven theoretically?
An insurance company pays an average of $10,000 per policy, but individual payouts range from $0 to $1 million. Why is the company confident in its calculations?
If the LLN speaks of convergence as n → ∞, how is it applied in practice with a finite n?
Why does cutting the error in half require four times as many observations, not just twice as many?