Statistics
Sampling: how 1,000 people predict the behavior of a billion
1936: George Gallup polled 50,000 people and correctly predicted Roosevelt's win, while Literary Digest with 2.4 million ballots failed catastrophically. Sample size is not the point - representativeness is.
- Nielsen TV ratings: 25,000 households decide what 130 million Americans watch - $70B in advertising per year
- FDA: ~3,000 patients in Phase-3 decide a drug's fate for millions
- A/B tests at Booking.com, Netflix, Amazon - thousands of parallel experiments per day
- ML training data: 13T tokens out of ~10²⁰ available - the sample that determines LLM quality
Population, sample, and bias
**1943. London.** The Allies are losing dozens of bombers per mission. The military gathers data: where do returning aircraft show the most bullet holes? Wings and tail are riddled; fuselage and engines are nearly clean. Decision: reinforce where the holes cluster. Statistician Abraham Wald looks at the same chart and says: **"reinforce the opposite areas"**. Explanation in one sentence: "These planes came back. The ones hit in the engines did not." This is survivorship bias - the most insidious sampling error there is.
**Population** - everything to draw conclusions about (all voters, all users, all molecules). **Sample** - the subset actually measured. The striking fact: a properly constructed sample of ~1,000 objects is enough to make inferences about populations billions of times larger. Accuracy depends on sample size, **not on population size**.
**1936.** Literary Digest polled **2.4 million** people and confidently predicted Alf Landon's win in the US presidential election. Roosevelt won 523:8 in the Electoral College. That same year George Gallup polled **50,000 people** with a random sample and predicted Roosevelt. Digest's mistake: they polled from lists of telephone and car owners - in the depths of the Great Depression, those were wealthy Republicans. Roosevelt's working-class voters never made it into the sample.
Four classic bias types: **selection bias** (who got chosen), **non-response bias** (who replies), **coverage bias** (who can be reached at all), **survivorship bias** (who is visible from what happened). Almost every sampling failure is a combination of these. The professional habit before any analysis: ask "who could have been in this data but wasn't?"
Literary Digest in 1936 polled 2.4 million people but got the election wrong. Gallup polled 50,000 and got it right. Why?
Point estimators: unbiasedness, consistency, and the CLT
The sample mean is **itself a random variable**. Not a number, but a quantity with its own distribution. Why? Repeat the experiment and the sample changes; the mean changes too. **The number called "sample mean" has a distribution, variance, and standard deviation** - simply because it depends on a random sample. This distribution is called the sampling distribution.
An estimator θ̂ is **unbiased** if E[θ̂] = θ (expected value equals the true parameter). **Consistent** if θ̂ converges to θ as n → ∞. The sample mean X̄ is unbiased for μ. The sample variance with (n-1) in the denominator - not n - is unbiased for σ². The divisor n-1 corrects for the fact that we centered at X̄ rather than μ, which uses up one degree of freedom.
The Central Limit Theorem (CLT): for sufficiently large n, the sample mean X̄ is approximately normally distributed with parameters (μ, σ²/n) - **regardless of the shape of the underlying distribution**. This is the foundation of every confidence interval and hypothesis test. The formula **SE = σ/√n** (standard error) describes how much X̄ deviates from μ.
Why does sample variance divide by (n-1) rather than n?
Confidence intervals and standard error
**Standard Error (SE)** is the standard deviation of the sampling distribution - how much X̄ fluctuates around the true μ. One formula with enormous industry consequences: SE(X̄) = σ/√n. Error decreases as n grows, but not linearly - as the square root. **Doubling accuracy** requires **quadrupling the sample**.
A 95% confidence interval: X̄ ± 1.96 × SE. **Correct interpretation**: if the experiment is repeated indefinitely, 95% of the constructed intervals will cover the true μ. **Incorrect**: "with 95% probability the true μ lies in this interval" - μ is fixed, probability does not apply to it. For small n (< 30) use the t-distribution with n-1 degrees of freedom instead of z = 1.96.
Population size **does not appear** in the SE formula. Polling 1,000 random residents of Moscow (12M) gives the same accuracy as polling 1,000 random people on the planet (8B). This counter-intuitive fact is what makes mass polling economically feasible. It is also why field surveys settled at ~1,000 respondents - the cost-accuracy sweet spot.
An A/B test finds a statistically significant 0.5% difference. A stakeholder says: "Let's double the sample to be twice as sure". What is wrong with this plan?
Summary
- A sample is representative when the mechanism of inclusion does not depend on the measured trait - otherwise no sample size will help
- Four biases: selection, non-response, coverage, survivorship - 80% of sampling failures are combinations of these
- X̄ is an unbiased estimator of μ; E[X̄] = μ by definition of a random sample
- By the CLT: X̄ ~ N(μ, σ²/n) as n → ∞, regardless of the shape of the population distribution
- SE = σ/√n - the square-root law; doubling accuracy requires quadrupling the sample
- Population size does not enter SE - 1,000 people are equally accurate for 1M and 8B
What's next
Sampling is "what gets seen". Next - how to draw meaningful conclusions from what's seen.
- Hypothesis testing — p-value, power, type I and II errors - the formal apparatus of A/B testing
- Bootstrap — Simulate the sampling distribution directly from one sample - the modern workhorse
- Bayesian inference — Alternative to the frequentist approach: posterior distribution instead of confidence intervals
Вопросы для размышления
- Which data the team currently uses might suffer from survivorship bias or selection bias?
- If the accuracy of a current A/B test had to be doubled - by how much would the budget increase?
- When was the last time the team asked "who could have been in this data but wasn't?" before starting analysis?