Probability Theory
Law of Large Numbers
Цели урока
- Build intuition for the law of large numbers through examples
- Understand why the sample mean stabilizes as the sample grows
- Distinguish the weak and strong LLN
- Understand the connection to the frequentist definition of probability
- See the LLN in action: casinos, insurance, ML
Предварительные знания
- Expected value E[X]
- Variance Var[X]
- Independent random variables
- Chebyshev's inequality
**1940, a wartime internment camp in Denmark.** South African mathematician John Kerrich, stuck with nothing better to do, ran an experiment. He flipped a coin **10,000 times** and recorded every result. After 10 flips, the heads fraction was 40%. After 100, 44%. After 1,000, 50.2%. After 10,000, 50.67%. The coin was "converging" to fair. This was the Law of Large Numbers in action.
- Casinos: why "the house always wins"
- Insurance: predictability with millions of policies
- Polls: why 1,000 people can represent millions
- A/B tests: how long to wait for a significant result
- ML: why mini-batch gradient descent works
The golden theorem: 20 years of work
Jacob Bernoulli spent **20 years** proving what seems intuitive today. He called the result his "golden theorem," and it was published posthumously in *Ars Conjectandi* (1713). This was the first rigorous link between **theory** (probability) and **practice** (frequency). Bernoulli's point: even a fool sees this intuitively, but proving it is another matter entirely.
Law of Large Numbers
Flip a coin once - heads or tails, huge spread. Flip it 10,000 times - the share of heads will be very close to 0.5. The fact is intuitive but requires a rigorous formulation: the **Law of Large Numbers** (LLN) is one of the founding results of probability theory.
Formally: for i.i.d. random variables $X_1, X_2, \dots$ with finite $\mathbb{E}[X]=\mu$, the sample mean $\bar{X}_n = \frac{1}{n}\sum X_i$ converges to $\mu$ as $n\to\infty$. This is the **weak LLN** (convergence in probability) or the **strong LLN** (almost-sure convergence).
The LLN is the theoretical justification of Monte Carlo methods in ML, empirical risk in supervised learning, and gradient estimation in SGD. Without it we could not justify the leap from 'sample' to 'population'.
The Law of Large Numbers states that:
LLN is the foundation of empirical estimation: frequency converges to probability, sample mean to expectation. It justifies Monte Carlo and statistical estimation in big data.
1. Intuition: what happens as the sample grows?
1. Intuition: what happens as the sample grows?
Suppose we flip a fair coin and track the heads fraction:
| Flips | Heads | Fraction | Deviation from 0.5 |
|---|---|---|---|
| 10 | 7 | 0.70 | 0.20 |
| 100 | 56 | 0.56 | 0.06 |
| 1,000 | 512 | 0.512 | 0.012 |
| 10,000 | 5,067 | 0.5067 | 0.0067 |
| 100,000 | 50,121 | 0.50121 | 0.00121 |
See the pattern? The deviation **decreases** approximately as $1/\sqrt{n}$.
**Key insight:** when the sample grows by a factor of 100, the deviation decreases by about a factor of 10 (by $\sqrt{100}$).
After 10,000 flips, the heads fraction is 0.52. After another 90,000 flips (100,000 total), what deviation from 0.5 is expected?
The sample grew by a factor of 10 (from 10k to 100k). The typical deviation decreases by √10 ≈ 3.16. Was ~0.02, becomes ~0.0063. LLN in action!
2. Formal statement
2. Formal statement
Let $X_1, X_2, X_3, \ldots$ be **independent identically distributed** (i.i.d.) random variables with:
- $E[X_i] = \mu$ (finite expected value)
- $Var[X_i] = \sigma^2 < \infty$ (finite variance, needed for the Chebyshev proof below; the strong LLN actually goes through with just $E[|X_i|] < \infty$)
**Sample mean:**
**Law of Large Numbers:** as $n \to \infty$, the sample mean converges to the expected value:
Weak vs. Strong LLN
There are two versions of the theorem; they differ in the **type of convergence**:
| Version | Authors | Conditions | Statement |
|---|---|---|---|
| Weak LLN | Bernoulli (1713), Khintchine (1929) | E[|X|] < ∞ | P(|X̄ₙ - μ| > ε) → 0 |
| Strong LLN | Kolmogorov (1930) | E[|X|] < ∞ | P(X̄ₙ → μ) = 1 |
**The difference:** the weak version says the probability of a large deviation goes to 0. The strong version says the entire sample path of X̄ₙ almost surely converges to μ. In practice the difference rarely matters.
What does "convergence in probability" mean in the weak LLN?
Convergence in probability: no matter how small ε is, the probability of deviating from μ by more than ε approaches zero. But that doesn't guarantee deviations never occur.
3. Why does it work? The math behind the intuition
3. Why does it work? The math behind the intuition
Compute the moments of the sample mean:
Expected value of X̄ₙ
The sample mean is an **unbiased** estimator of μ: on average, it hits μ exactly.
Variance of X̄ₙ: the key to the LLN
The variance of the sample mean **decreases as 1/n**. That is the heart of the LLN.
**Standard error of the mean:** $SE = \frac{\sigma}{\sqrt{n}}$ It decreases as $1/\sqrt{n}$. To cut the error in half, four times as many observations are required.
Proof of the weak LLN (via Chebyshev)
By Chebyshev's inequality:
The probability of a large deviation goes to zero as $n \to \infty$. ∎
X has σ = 10. What is the standard error of the mean when n = 100?
SE = σ/√n = 10/√100 = 10/10 = 1. With 100 observations, the standard error is 10 times smaller than the standard deviation of a single observation.
4. Connection to the frequentist definition of probability
4. Connection to the frequentist definition of probability
The LLN **justifies** the frequentist definition of probability!
If event A has probability p, then its relative frequency in n independent trials:
This is a special case of the LLN: let $X_i = 1$ if A occurred, 0 otherwise. Then $E[X_i] = p$ and $\bar{X}_n$ is the relative frequency of A.
Kerrich's data
A real experiment from his wartime internment
Kerrich recorded results in batches:
| Flips | Heads | Fraction |
|---|---|---|
| 10 | 4 | 0.400 |
| 50 | 25 | 0.500 |
| 100 | 44 | 0.440 |
| 500 | 255 | 0.510 |
| 1,000 | 502 | 0.502 |
| 5,000 | 2,533 | 0.5066 |
| 10,000 | 5,067 | 0.5067 |
The frequency stabilizes around 0.5, but not monotonically.
The probability of rolling a 6 is 1/6 ≈ 0.167. After 600 rolls, 120 sixes appeared (20%). Does this contradict the LLN?
SE = √(p(1-p)/n) = √(0.167×0.833/600) ≈ 0.015. The deviation 20% - 16.7% = 3.3% is about 2.2 standard errors. Rare, but not impossible. The LLN is a limit statement (n → ∞), not a guarantee for any specific n.
5. The main error: "The Law of Averages"
5. The main error: "The Law of Averages"
After 10 heads in a row, tails "must" come up to compensate
Each flip is independent. The LLN works through dilution, not compensation
10 heads in a row is just 10 flips. After 10,000 flips, their share of the total is 10/10000 = 0.1%. New flips dilute the past; they don't correct it.
Dilution vs. compensation
Why the LLN is not about compensation
After 10 heads in a row: fraction = 10/10 = 100%. Scenario 1 (compensation): the next 10 are tails. Fraction = 10/20 = 50%. **Wrong intuition.** Scenario 2 (dilution): the next 9,990 flips are 50% heads. Fraction = (10 + 4,995) / 10,000 = 50.05%. The LLN drowns the 10 heads in the mass of new data; it does not compensate for them.
**Gambler's Fallacy:** the belief that past outcomes affect future ones in independent trials. The coin **does not remember** what came before!
At roulette, red has come up 5 times in a row. What can be said about the next spin?
Roulette spins are independent. P(red) = 18/37 every time. Past spins do not affect future ones. Believing they do is the gambler's fallacy.
Practice
Practice
A factory produces components with a 3% defect rate. An inspector checks 900 components. What is the probability that the fraction of defects deviates from 3% by more than 1%?
$SE = \sqrt{\frac{0.03 \times 0.97}{900}} = \sqrt{\frac{0.0291}{900}} \approx 0.0057$ Deviation of 1% = 0.01. In SE units: z = 0.01 / 0.0057 ≈ 1.75 P(|X̄ - 0.03| > 0.01) = P(|Z| > 1.75) ≈ 2 × (1 - 0.96) = **8%** With ~92% probability, the fraction will be between 2% and 4%.
A/B test: control (A) shows 5% conversion, test (B) shows 6%. How many users per group are needed for the difference to be statistically meaningful (SE < 0.5%)?
$SE_{diff} \approx \sqrt{\frac{2 \times 0.055 \times 0.945}{n}} = \sqrt{\frac{0.104}{n}}$ Need $SE < 0.005$: $\sqrt{\frac{0.104}{n}} < 0.005$ $\frac{0.104}{n} < 0.000025$ $n > \frac{0.104}{0.000025} = 4160$ **~4,200 users per group** minimum.
A casino has a 2% edge in every game. The average bet is $100. What is the casino's approximate profit from 1 million games? Why is the casino confident in this?
$E[\text{profit per game}] = 0.02 \times 100\,\text{USD} = 2\,\text{USD}$ $E[\text{profit from 1M games}] = 1{,}000{,}000 \times 2\,\text{USD} = 2{,}000{,}000\,\text{USD}$ By the LLN, the average profit per game converges to 2 USD. Over a million games, fluctuations are negligible compared to the total. **The casino plays the long game**, and the LLN guarantees its profit.
A casino has a 2% edge on a $\$100$ bet and runs 1,000,000 plays. Why does the LLN make a profit of about $\$2{,}000{,}000$ near-certain?
The LLN sends the per-play average to $E[X] = \$2$. The standard error of the mean shrinks as $\sigma/\sqrt{n}$, so at $n = 10^6$ deviations from $\$2{,}000{,}000$ are tiny in relative terms - the casino's business model rests on this volume.
LLN: the foundation of statistics
The law of large numbers underlies almost every statistical method.
- CLT — The next step: not just convergence, but the shape of the distribution
- Parameter estimation — Consistency of estimators is based on the LLN
- Monte Carlo method — Estimating integrals through random sampling
- ML: SGD — Mini-batch gradient converges to the full gradient by the LLN
- Insurance — Predictability with a large number of policies
Итоги
- **LLN:** $\bar{X}_n \to \mu$ as $n \to \infty$
- **Why it works:** $Var[\bar{X}_n] = \sigma^2/n \to 0$
- **Standard error:** $SE = \sigma/\sqrt{n}$ - decreases as $1/\sqrt{n}$
- **Frequency converges to probability:** the justification for the frequentist interpretation
- **Not compensation:** the past is diluted, not corrected
Вопросы для размышления
- Back to Kerrich: why was his experiment important for science, even though the LLN had already been proven theoretically?
- An insurance company pays an average of $10,000 per policy, but individual payouts range from $0 to $1 million. Why is the company confident in its calculations?
- If the LLN speaks of convergence as n → ∞, how is it applied in practice with a finite n?
- Why does cutting the error in half require four times as many observations, not just twice as many?