Statistics

Hypothesis Testing: How p-values Killed 64,000 Studies

In 2015, Science magazine tried to replicate 100 psychology studies - only 36% held up. Amgen replicated 53 landmark cancer papers - only 6% confirmed. The p-value crisis reshaped how tech companies like Airbnb and Spotify run experiments today.

  • Replication crisis 2015: 64% of psychology studies failed independent replication
  • FDA drug approval: alpha=0.05 threshold for primary endpoint significance
  • Multiple testing at Airbnb and Spotify: Bonferroni and BH corrections
  • GWAS genomics: genome-wide significance threshold p < 5e-8 (not 0.05)
  • ML evaluation: permutation tests instead of parametric assumptions
  • p-hacking prevention: pre-registration and sequential testing (alpha spending)

Предварительные знания

  • (no prerequisites)
  • Confidence Intervals: How Journalists Misread the 2016 Election

Semmelweis, 1847: When Data Are Not Enough

**2015. 270 scientists join forces in the Open Science Collaboration and do something unprecedented.** They take 100 published psychology studies - all peer-reviewed, all showing p < 0.05 - and attempt to replicate them. The result: **only 36% replicated**. 64 out of 100 "proven" findings vanished on repetition. This is called the "replication crisis". The shock spread through medicine, economics, and neuroscience. The culprit was not fraud or negligence - it was a fundamental misunderstanding of what p < 0.05 actually means. The story begins in 1847 in Vienna.

0

1

Sign In

**What this lesson actually teaches**: not "how to compute a p-value", but **why this concept is simultaneously so powerful and so dangerous**. p-value is the one statistical concept misinterpreted even by textbook authors - documented in 2002. After this lesson it will be clear: what exactly alpha = 0.05 guarantees, what test power means, and why an A/B test with a "result" of p = 0.04 can mean precisely nothing.

Semmelweis, 1847: When Data Are Not Enough

Ignaz Semmelweis worked in the maternity ward of a Vienna hospital. Maternal mortality from childbed fever: 10% in one wing, 1.5% in the other. The only difference: the first wing was staffed by doctors who had been performing autopsies, the second by midwives. Semmelweis hypothesized that doctors were transferring "cadaverous particles." He introduced mandatory handwashing with chlorinated water. Mortality dropped to 1%. The statistics were irrefutable. The medical establishment rejected him. In 1865 Semmelweis was committed to a psychiatric institution and died there of sepsis - likely from the very infection he had described.

Semmelweis's problem was not a lack of data - there was plenty. The problem was the absence of **a formal language for making decisions from data**. That language emerged between 1920 and 1933 in the work of Fisher, Neyman, and Pearson. It is called statistical hypothesis testing.

In 1847 Semmelweis had convincing data: mortality dropped from 10% to 1% after handwashing. Why did the medical community reject him?

The Logic of Hypotheses: Presumption of Innocence

The Logic of Hypotheses: Presumption of Innocence

Hypothesis testing works like a courtroom: **the null hypothesis H0 is innocent until proven otherwise**. The alternative hypothesis H1 is what one is trying to establish. Data are collected, and the degree to which they contradict H0 is measured. If the contradiction is strong enough - H0 is rejected.

H0 (null)H1 (alternative)
Meaning"Nothing happened""There is a real effect"
SemmelweisMortality is the same in both wingsHandwashing reduces mortality
A/B testConversion for variants A and B is equalVariant B converts better
Drug trialNew drug is no better than placeboDrug is effective
ML modelNew model is no better than baselineNew model is significantly better

**Practical rule**: H0 always contains an equality sign (no effect, no difference, no association). H1 contains an inequality. H0 cannot be "proven" - it can only be "not rejected" when data are insufficient. Like an acquittal: not "innocent", but "guilt not proven".

Which formulation of the null hypothesis H₀ is correct for an A/B test of a new button?

p-value: The Most Dangerous Number in Science

p-value: The Most Dangerous Number in Science

p-value is the probability of obtaining data as extreme as (or more extreme than) observed, **given that H0 is true**. A small p-value means: "if the null effect were real, data like these would almost never appear." This is grounds to doubt H0.

Notation: T - test statistic (computed from data), t_obs - observed value p = P(|T| >= |t_obs| | H0 is true) Two-tailed test (H1: mu != 0): p = P(|T| >= |t_obs|) One-tailed test (H1: mu > 0): p = P(T >= t_obs) Rejection rule: if p < alpha -> reject H0 if p >= alpha -> do not reject H0 alpha - significance level, usually 0.05 or 0.01 ANALOGY: p-value is like the probability of flipping heads 10 times in a row. If the coin is fair (H0) and 10 heads are observed: p = (1/2)^10 ~ 0.001 This is extremely unlikely under H0 -> reject H0.

**Three major p-value misconceptions** (documented by Gigerenzer, 2002): 1. p = probability that H0 is true <- WRONG. p = P(data | H0), not P(H0 | data) 2. 1-p = probability that H1 is true <- WRONG. This is a Bayesian claim, requires a prior 3. p < 0.05 = "important result" <- WRONG. p-value does not measure effect size Correct: p < 0.05 means "the data are incompatible with H0 at the 5% level".

What does a p-value of 0.03 actually mean?

Type I and Type II Errors: Two Ways to Be Wrong

Type I and Type II Errors: Two Ways to Be Wrong

When making a decision from data, two errors are possible. Reducing one automatically increases the other - this is a fundamental trade-off.

H0 is trueH1 is true
Reject H0Type I error (alpha) - false alarmCorrect decision (power = 1-beta)
Do not reject H0Correct decision (1-alpha)Type II error (beta) - missed effect

Type I error (alpha, false positive): Reject H0 when it is true. "Found" an effect that does not exist. Controlled by the choice of alpha = 0.05 (or 0.01). This is exactly what the p-value guarantees. Type II error (beta, false negative): Do not reject H0 when H1 is true. Missed a real effect. Depends on: effect size, n, sigma, alpha. Test power (power = 1 - beta): Probability of detecting a real effect when one exists. Standard: power >= 0.80 (80%). Larger n and larger effects -> higher power. Trade-off: decrease alpha (stricter on Type I) -> beta increases (worse at catching effects). Solution: increase n - reduces both types of errors simultaneously.

The Replication Crisis Through the Lens of Errors

Why 64% of studies did not replicate

Typical psychology study of the 2010s: n = 30-50 participants (small sample) alpha = 0.05 (standard) Real effect: small Under these conditions test power ~ 20-40%. Meaning: even if the effect is real, 60-80% of experiments will NOT find it. But only p < 0.05 results get published (publication bias). Survivor bias in pure form: the "successful" experiments are visible, the 10 failed attempts are not. Only p < 0.05 results are selected for publication. This resembles flipping a coin until the first heads and declaring "the coin always lands heads".

What is statistical power (1 - β), and why is it critical for interpreting a non-significant result?

p-hacking and Multiple Comparisons

p-hacking and Multiple Comparisons

With a single test at alpha = 0.05, the false-positive rate is 5%. But running 20 independent tests under H0, the expected number of "significant" results is 1. With 100 tests - 5 "discoveries" from nothing. This is the **multiple comparisons problem** - critical in genomics, neuroimaging, and product analytics.

Number of testsP(at least one false positive)Example use case
15%Single A/B test
523%Testing 5 metrics in one experiment
2064%Small-scale genomic screening
10099.4%Neuroimaging (thousands of voxels)
1,000~100%GWAS: 1M SNPs in genetics

**Corrections**: Bonferroni correction (alpha/k) - conservative, reduces power. Benjamini-Hochberg FDR control - controls the proportion of false discoveries rather than the probability of at least one. Production A/B platforms (Netflix, Airbnb) use more modern methods: sequential testing, e-values - covered in lessons 52-54 of the course.

You run 20 independent tests at α = 0.05 under H₀ for each. What is the probability of at least one false positive?

Practice: z-test from Scratch

Practice: z-test from Scratch

Compute a z-test: X̄ = 105, μ₀ = 100, σ = 15, n = 36. What is the test statistic and the conclusion at α = 0.05?

Key Takeaways

  • **H0 = presumption of innocence**: no effect, no difference. H1 is what gets established; H0 is only rejected or not rejected
  • **p-value = P(data | H0)**: not the probability that H0 is true, not the importance of the effect - only the incompatibility of the data with the null hypothesis
  • **alpha = 0.05**: if H0 is true, 5% of tests will falsely show significance. This is not the error probability for any specific test
  • **Power 1-beta**: probability of finding a real effect. With small samples power is low - this is exactly why 64,000 studies did not replicate
  • **Multiple comparisons**: k tests under H0 produce k*alpha false discoveries. Correction is required (Bonferroni or FDR)
  • **Peeking = error**: monitoring an A/B test and stopping at p < 0.05 breaks the alpha guarantee

What's Next

The general hypothesis testing framework is the foundation. Next come specific tests for specific tasks.

  • Student's t-test — The most common test: comparing means with unknown sigma and small n
  • Chi-square — Test for categorical data: goodness-of-fit and independence
  • Bootstrap — Compute p-values without distributional assumptions
  • E-values and anytime-valid tests — Solution to the peeking problem: tests valid under continuous monitoring

Связанные уроки

  • aie-31-evaluation
  • ml-05-evaluation
Hypothesis Testing: How p-values Killed 64,000 Studies