Statistics

Confidence Intervals: How Journalists Misread the 2016 Election

"52% plus or minus 3%" - virtually every poll reports this, and 95% of people misread it. A confidence interval is not a probability that the true value is inside. This 30-year debate between Neyman and Fisher still shapes how data teams run A/B tests today.

Election polls: 95% CI construction from pre-election surveys
A/B tests at Stripe and Airbnb: CI for conversion rate lifts
FDA clinical trials: confidence intervals required for primary endpoints
Replication crisis: 95% CI crossing zero means non-significant result
ML production metrics: CI for precision/recall at deployment time
Financial risk: Value-at-Risk as a one-sided confidence bound

Предварительные знания

(no prerequisites)

Why a Point Estimate Is Not Enough

**November 2016. On the night of the American election, every major outlet shows the same picture.** Hillary Clinton: 52% support, 95% CI [49%, 55%]. Donald Trump: 44%, CI [41%, 47%]. Live commentary: 'With 95% confidence the true support for Clinton is between 49 and 55% - she will win.' Trump won with 46.1% against 48.2% in the popular vote, but with 306 electoral votes. The polls did not fail catastrophically - the margin of error fit within the CIs. **The interpretation failed.** '95% CI' does not mean '95% probability that the true value is inside the interval'. It is a fundamentally different statement - and the confusion between the two cost journalists their professional credibility and betting markets millions of dollars.

**What this lesson actually teaches**: not "how to compute X-bar +/- 1.96*SE", but why **the correct interpretation of a confidence interval is counter-intuitive**, and what exactly the '95%' guarantees. After this lesson it will be clear why A/B tests are not concluded 'when the CI stops crossing zero', and why a frequentist CI and a Bayesian credible interval say fundamentally different things.

Why a Point Estimate Is Not Enough

MLE gives one number - the best point approximation of the parameter. Without knowing how accurate that number is, it is nearly useless. Saying 'mean server latency is 142 ms' without any indication of spread is like saying 'the average temperature of the patients in our hospital is normal': some are healthy, some have died. **A confidence interval is an honest way to report the uncertainty of an estimate.**

Context	Point estimate	With confidence interval	What changes
Drug effect	Reduces blood pressure by 8 mmHg	8 mmHg [2, 14], 95% CI	Could be minimal or substantial - FDA requires CI
A/B conversion test	Variant B is better by 0.8%	0.8% [-0.2%, 1.8%], 95% CI	Lower bound is negative - not yet proven
API latency	p99 = 230 ms	p99 = 230 ms [215, 248], 95% CI	SLA 250 ms: will it hold or not?
Ad CTR	Campaign A: CTR 3.2%	3.2% [2.9%, 3.5%], 95% CI	Realistic range for budget planning

Why is the point estimate X̄ = 142 ms alone not enough for an engineering decision?

What 95% CI Actually Means

Here is the most important fact in this lesson. After constructing a specific interval, say [48.9%, 55.1%], **it is not valid to speak of a 95% probability** that the true parameter lies inside it. The parameter either is in the interval or it is not - this is not a probabilistic event in the frequentist framework. The correct interpretation sounds different.

A 95% confidence interval for theta is a random interval [L(X), U(X)], constructed from a random sample X = (X1,...,Xn), such that: P(L(X) <= theta <= U(X)) = 0.95 Key point: **the bounds L and U are random, not theta**. The parameter theta is fixed (though unknown). The interval changes from sample to sample. Correct statement: 'The construction method is such that, over many repetitions, 95% of the built intervals contain the true theta' Incorrect statement: 'There is a 95% probability that theta is in [48.9, 55.1]' Once the sample is drawn and the interval is built: [48.9%, 55.1%] either contains theta (prob=1) or does not (prob=0). Which of the two is unknown, but it is no longer a question of probability. The probabilistic statement was made BEFORE the specific interval was built.

Analogy: Fishing with a Net

Intuition without formulas

A fisherman casts a net at a random spot in the river. 95% of the river's area is covered by fish. Question: what is the probability the net catches fish? Answer before casting: 95%. After the cast, the net is at a specific location. Question: with what probability is fish in the net? Answer: either it is or it is not - this is no longer a probability. Confidence interval = net of fixed size. True theta = fish at a specific spot. 95% = probability of catching fish BEFORE the cast. After a specific CI is built - the fish is either caught or not.

**Historical context**: Jerzy Neyman invented confidence intervals in 1937 with exactly this 'procedural' interpretation. Ronald Fisher simultaneously proposed 'fiducial intervals' with a different interpretation. They argued publicly for 25 years. Modern statistics adopted Neyman's interpretation. For the probabilistic statement 'P(theta in interval) = 95%' one needs a Bayesian credible interval, which requires a prior.

After building a specific 95% CI [48.9%, 55.1%] for the true share θ, what is the CORRECT reading of "95%"?

How to Build One: Pivots and Their Consequences

The standard method for constructing a CI uses a **pivot**: a function of the data and the parameter whose distribution does not depend on the parameter. For the normal distribution this is the standardized sample mean.

CASE 1: sigma known (z-interval) (X-bar - mu) / (sigma/sqrt(n)) ~ N(0, 1) P(-1.96 <= (X-bar - mu)/(sigma/sqrt(n)) <= 1.96) = 0.95 (solve the inequality for mu) P(X-bar - 1.96*sigma/sqrt(n) <= mu <= X-bar + 1.96*sigma/sqrt(n)) = 0.95 95% CI: X-bar +/- 1.96 * sigma/sqrt(n) QUANTILES: 90%: z = 1.645 95%: z = 1.960 99%: z = 2.576 CASE 2: sigma unknown (t-interval) (X-bar - mu) / (S/sqrt(n)) ~ t(n-1) [Student's t-distribution] 95% CI: X-bar +/- t_{alpha/2, n-1} * S/sqrt(n) For n >= 30 the difference between t and z is under 5%; at n=5: t_{0.025,4} = 2.776 vs z = 1.960

Observe k successes out of n: p-hat = k/n. By CLT for large n: p-hat ~ N(p, p*(1-p)/n) 95% CI: p-hat +/- 1.96 * sqrt(p-hat*(1-p-hat)/n) Example: 540 out of 1000 for candidate A p-hat = 0.54 SE = sqrt(0.54*0.46/1000) = sqrt(0.000248) ~ 0.01575 95% CI: 0.54 +/- 1.96*0.01575 = 0.54 +/- 0.031 -> [50.9%, 57.1%] The '+/- 3%' margin of error in polls is exactly this calculation. For any n: maximum margin at p=0.5 (worst case). For n=1000: max SE = 0.5/sqrt(1000) ~ 1.58%, max margin = 1.96*1.58% ~ 3.1%.

**The square-root law**: CI width is proportional to 1/sqrt(n). To halve the interval, the sample must be QUADRUPLED. This has a direct practical consequence for A/B tests: if conversion is 3% and a 0.1% effect needs to be detected, millions of users are required. Power calculators compute exactly this.

A poll shows 540 "yes" out of 1000. What is the 95% CI for the true support?

Width, Confidence Level, and Sample Size

Three parameters are linked: confidence level (1-alpha), interval width (2E), and sample size (n). Fixing any two determines the third.

n	95% CI for proportion (p=0.5)	99% CI (p=0.5)	Comment
100	+/-9.8%	+/-12.9%	A poll of 100 people - very wide
400	+/-4.9%	+/-6.4%	Twice as precise - needs 4x more data
1 000	+/-3.1%	+/-4.1%	Standard for a national poll
2 500	+/-2.0%	+/-2.6%	Accuracy of Gallup polls
10 000	+/-1.0%	+/-1.3%	Large A/B tests on conversion

Want a margin of error no larger than E at confidence level (1-alpha). For the mean (sigma known): n >= (z_{alpha/2} * sigma / E)^2 For a proportion (worst case p=0.5): n >= (z_{alpha/2} * 0.5 / E)^2 Example: 95% CI with margin <= 2%: n >= (1.96 * 0.5 / 0.02)^2 = (49)^2 = 2401 Note: for A/B tests a different formula is used (accounts for test power - probability of detecting an effect if it exists). The simple n formula is for interval width, not for power.

To halve the width of a 95% confidence interval at the same confidence level, by what factor must the sample size grow?

Confidence Intervals in Tools and Production Systems

**Bayesian alternative - credible interval**: P(theta in [a,b] | data) = 0.95. This is exactly the probabilistic interpretation one wants. But it requires a prior P(theta) and more complex computation (MCMC). Most production systems use frequentist CI for simplicity. Bayesian credible intervals are used where prior information is valuable: clinical trials, risk management, parameter estimation on small samples.

What is the key difference between a frequentist CI and a Bayesian credible interval?

Practice: Empirical Coverage Verification

You run a Monte Carlo simulation: generate 10 000 samples with known μ, build a 95% CI for each, and count how many cover μ. What should you observe?

Key Takeaways

**Correct interpretation**: '95% of intervals built from different samples will cover theta' - not '95% probability that theta is in this specific interval'
**For the mean**: X-bar +/- z*sigma/sqrt(n) (z-interval, sigma known) or X-bar +/- t*S/sqrt(n) (t-interval, sigma unknown, df=n-1)
**For a proportion**: p-hat +/- 1.96*sqrt(p-hat*(1-p-hat)/n). The '+/- 3%' in polls = exactly this formula at n=1000
**Width is proportional to 1/sqrt(n)**: twice as precise = four times as many data points. The fundamental economic law of A/B testing
**Peeking problem**: checking the test early violates the 95% guarantee. Sequential testing is needed
**Frequentist vs Bayesian**: CI is procedural, credible interval is probabilistic. The numbers look similar; the meaning differs

What's Next

A CI shows a range. Hypothesis testing answers a binary question: does the parameter differ from zero?

Hypothesis Testing — The complement to CI: p-value, power, Type I and II errors - the formal apparatus of A/B tests
Bootstrap — Build a CI for any statistic without analytic formulas through resampling
E-values and Anytime-Valid CI — A solution to the peeking problem: a CI valid at any stopping time, without multiple-testing corrections
Bayesian Inference — Credible interval P(theta in [a,b] | X) = 95%: the intuitive probabilistic statement

Связанные уроки

aie-31-evaluation

Statistics

Confidence Intervals: How Journalists Misread the 2016 Election

Election polls: 95% CI construction from pre-election surveys
A/B tests at Stripe and Airbnb: CI for conversion rate lifts
FDA clinical trials: confidence intervals required for primary endpoints
Replication crisis: 95% CI crossing zero means non-significant result
ML production metrics: CI for precision/recall at deployment time
Financial risk: Value-at-Risk as a one-sided confidence bound

Предварительные знания

(no prerequisites)

Why a Point Estimate Is Not Enough

Context	Point estimate	With confidence interval	What changes
Drug effect	Reduces blood pressure by 8 mmHg	8 mmHg [2, 14], 95% CI	Could be minimal or substantial - FDA requires CI
A/B conversion test	Variant B is better by 0.8%	0.8% [-0.2%, 1.8%], 95% CI	Lower bound is negative - not yet proven
API latency	p99 = 230 ms	p99 = 230 ms [215, 248], 95% CI	SLA 250 ms: will it hold or not?
Ad CTR	Campaign A: CTR 3.2%	3.2% [2.9%, 3.5%], 95% CI	Realistic range for budget planning

Why is the point estimate X̄ = 142 ms alone not enough for an engineering decision?

What 95% CI Actually Means

Analogy: Fishing with a Net

Intuition without formulas

After building a specific 95% CI [48.9%, 55.1%] for the true share θ, what is the CORRECT reading of "95%"?

How to Build One: Pivots and Their Consequences

A poll shows 540 "yes" out of 1000. What is the 95% CI for the true support?

Width, Confidence Level, and Sample Size

Three parameters are linked: confidence level (1-alpha), interval width (2E), and sample size (n). Fixing any two determines the third.

n	95% CI for proportion (p=0.5)	99% CI (p=0.5)	Comment
100	+/-9.8%	+/-12.9%	A poll of 100 people - very wide
400	+/-4.9%	+/-6.4%	Twice as precise - needs 4x more data
1 000	+/-3.1%	+/-4.1%	Standard for a national poll
2 500	+/-2.0%	+/-2.6%	Accuracy of Gallup polls
10 000	+/-1.0%	+/-1.3%	Large A/B tests on conversion

To halve the width of a 95% confidence interval at the same confidence level, by what factor must the sample size grow?

Confidence Intervals in Tools and Production Systems

What is the key difference between a frequentist CI and a Bayesian credible interval?

Practice: Empirical Coverage Verification

You run a Monte Carlo simulation: generate 10 000 samples with known μ, build a 95% CI for each, and count how many cover μ. What should you observe?

Key Takeaways

**Correct interpretation**: '95% of intervals built from different samples will cover theta' - not '95% probability that theta is in this specific interval'
**For the mean**: X-bar +/- z*sigma/sqrt(n) (z-interval, sigma known) or X-bar +/- t*S/sqrt(n) (t-interval, sigma unknown, df=n-1)
**For a proportion**: p-hat +/- 1.96*sqrt(p-hat*(1-p-hat)/n). The '+/- 3%' in polls = exactly this formula at n=1000
**Width is proportional to 1/sqrt(n)**: twice as precise = four times as many data points. The fundamental economic law of A/B testing
**Peeking problem**: checking the test early violates the 95% guarantee. Sequential testing is needed
**Frequentist vs Bayesian**: CI is procedural, credible interval is probabilistic. The numbers look similar; the meaning differs

What's Next

A CI shows a range. Hypothesis testing answers a binary question: does the parameter differ from zero?

Hypothesis Testing — The complement to CI: p-value, power, Type I and II errors - the formal apparatus of A/B tests
Bootstrap — Build a CI for any statistic without analytic formulas through resampling
E-values and Anytime-Valid CI — A solution to the peeking problem: a CI valid at any stopping time, without multiple-testing corrections
Bayesian Inference — Credible interval P(theta in [a,b] | X) = 95%: the intuitive probabilistic statement

Связанные уроки

aie-31-evaluation