Statistics

Hypothesis Testing: How p-values Killed 64,000 Studies

In 2015, Science magazine tried to replicate 100 psychology studies - only 36% held up. Amgen replicated 53 landmark cancer papers - only 6% confirmed. The p-value crisis reshaped how tech companies like Airbnb and Spotify run experiments today.

Replication crisis 2015: 64% of psychology studies failed independent replication
FDA drug approval: alpha=0.05 threshold for primary endpoint significance
Multiple testing at Airbnb and Spotify: Bonferroni and BH corrections
GWAS genomics: genome-wide significance threshold p < 5e-8 (not 0.05)
ML evaluation: permutation tests instead of parametric assumptions
p-hacking prevention: pre-registration and sequential testing (alpha spending)

Предварительные знания

(no prerequisites)

Confidence Intervals: How Journalists Misread the 2016 Election

Semmelweis, 1847: When Data Are Not Enough

**2015. 270 scientists join forces in the Open Science Collaboration and do something unprecedented.** They take 100 published psychology studies - all peer-reviewed, all showing p < 0.05 - and attempt to replicate them. The result: **only 36% replicated**. 64 out of 100 "proven" findings vanished on repetition. This is called the "replication crisis". The shock spread through medicine, economics, and neuroscience. The culprit was not fraud or negligence - it was a fundamental misunderstanding of what p < 0.05 actually means. The story begins in 1847 in Vienna.

	H0 (null)	H1 (alternative)
Meaning	"Nothing happened"	"There is a real effect"
Semmelweis	Mortality is the same in both wings	Handwashing reduces mortality
A/B test	Conversion for variants A and B is equal	Variant B converts better
Drug trial	New drug is no better than placebo	Drug is effective
ML model	New model is no better than baseline	New model is significantly better

	H0 is true	H1 is true
Reject H0	Type I error (alpha) - false alarm	Correct decision (power = 1-beta)
Do not reject H0	Correct decision (1-alpha)	Type II error (beta) - missed effect

Number of tests	P(at least one false positive)	Example use case
1	5%	Single A/B test
5	23%	Testing 5 metrics in one experiment
20	64%	Small-scale genomic screening
100	99.4%	Neuroimaging (thousands of voxels)
1,000	~100%	GWAS: 1M SNPs in genetics

Hypothesis Testing: How p-values Killed 64,000 Studies

Предварительные знания

Semmelweis, 1847: When Data Are Not Enough

Hypothesis Testing: How p-values Killed 64,000 Studies

Предварительные знания

Semmelweis, 1847: When Data Are Not Enough

Semmelweis, 1847: When Data Are Not Enough

The Logic of Hypotheses: Presumption of Innocence

The Logic of Hypotheses: Presumption of Innocence

p-value: The Most Dangerous Number in Science

p-value: The Most Dangerous Number in Science

Type I and Type II Errors: Two Ways to Be Wrong

Type I and Type II Errors: Two Ways to Be Wrong

The Replication Crisis Through the Lens of Errors

p-hacking and Multiple Comparisons

p-hacking and Multiple Comparisons

Practice: z-test from Scratch

Practice: z-test from Scratch

Key Takeaways

What's Next

Связанные уроки