Statistics

Multiple Testing Correction

'We found the gene that causes schizophrenia!' - there were hundreds of such claims. Replication rate: 0%. The culprit: testing millions of SNPs without multiple testing correction. In 2005 Ioannidis showed that most published findings are false. Understanding multiple testing is a matter of scientific integrity.

GWAS: threshold p < 5×10^-8 instead of 0.05 for 1 million SNPs
Neuroimaging: thousands of voxels tested for activation - strict FWE or FDR required
A/B testing: companies test 10+ metrics and risk finding a 'significant' one by chance
Reproducibility crisis in psychology: many 2000s results failed to replicate
Clinical trials: regulatory agencies (FDA, EMA) require an explicit correction procedure

Предварительные знания

Hypothesis Testing: How p-values Killed 64,000 Studies

The FWER Problem and Bonferroni Correction

**The multiple testing problem:** with m independent tests at level α, the probability of at least one false discovery = 1 − (1−α)^m. At m=20 and α=0.05: P(at least one false) = 1 − 0.95^20 ≈ 64%! **FWER** (Familywise Error Rate) - probability of at least one Type I error across the family of tests. **Bonferroni correction:** α_adj = α/m. Conservative; provides strict FWER control.

**The story of p-hacking:** in 2005 John Ioannidis published 'Why Most Published Research Findings Are False'. The main culprit: researchers test many hypotheses but only publish significant results (publication bias) without correcting for multiple testing. In genomics: millions of SNPs, threshold p < 5×10^-8 instead of 0.05.

A researcher conducts 50 independent t-tests at α=0.05. How many false positives are expected if all H₀ are true?

FDR: The Benjamini-Hochberg Procedure

**FDR (False Discovery Rate)** - the expected fraction of false discoveries among all rejected H₀. FDR = E[V/R], where V = false rejections and R = total rejections. Less conservative than FWER (Bonferroni). **Benjamini-Hochberg (BH) procedure:** 1. sort p-values: p(1) ≤ p(2) ≤ ... ≤ p(m) 2. find the largest k such that p(k) ≤ (k/m) × q 3. reject H₁, ..., H_k.

**q-value**: the adjusted p-value under BH. q = p × m/rank. Widely used in genomics: 'FDR q < 0.05' means no more than 5% of declared discoveries are expected to be false. The `statsmodels.stats.multitest.multipletests` function implements all major correction methods.

GWAS: 500,000 SNPs tested. BH with q=0.05 rejected 1,000 H₀. What does this mean?

Practical Guidance: Which Method and When

**The choice of method depends on the cost of errors:** if one false discovery is catastrophic (new drug trial: one approved ineffective drug = patient harm) → FWER (Bonferroni or Holm). If ~5% 'noise' among discoveries is acceptable (genomics: we'll validate 1,000 candidates in the lab; 50 false is fine) → FDR (BH).

Situation	Method	Why
Clinical trials, 2 - 5 endpoints	Bonferroni or Holm	One error = patient harm; strict FWER
Genomics, 500K+ SNPs	BH (FDR q=0.05)	~5% noise acceptable; power matters
Neuroimaging (voxel-wise)	FWE (GRF) or FDR	Spatial correlation; domain-specific tools
A/B test, 10+ metrics	BH or Holm	Depends on importance of each metric
Single comparison, no family	No correction	A lone test needs no correction

A pharmaceutical company simultaneously tests a new drug on 4 primary endpoints (mortality, stroke, myocardial infarction, hospitalisations) in one trial. Which correction should be used?

Key Ideas

With m tests, FWER = 1 − (1−α)^m: at m=20 this is 64% false discoveries!
Bonferroni: α/m - strict, conservative; for clinical and regulatory trials
Holm-Bonferroni: less conservative than Bonferroni, same FWER
BH (Benjamini-Hochberg): controls FDR (fraction of false discoveries); more powerful
q-value = adjusted p-value; FDR q < 0.05 → ≤5% false among significant findings
FWER: for clinical settings where one error is costly; FDR: for omics and discovery

Connections to Other Methods

Multiple testing extends hypothesis testing (families of tests), FDR is ubiquitous in bioinformatics (DESeq2, limma), and permutation tests provide exact FWER control (Westfall-Young).

Hypothesis Testing — Multiple testing extends the single-test framework to a family
Bootstrap and Resampling — Permutation tests with Westfall-Young provide exact FWER

Вопросы для размышления

Why does pre-registration of hypotheses solve part of the multiple testing problem?
In the A/B test one have 10 metrics. One is significant (p=0.02). Should applying Bonferroni or BH? How does the conclusion change?
What is the 'reproducibility crisis'? How do multiple testing and p-hacking contribute to false discoveries in science?

Связанные уроки

prob-04-bayes

The FWER Problem and Bonferroni Correction

A researcher conducts 50 independent t-tests at α=0.05. How many false positives are expected if all H₀ are true?

FDR: The Benjamini-Hochberg Procedure

GWAS: 500,000 SNPs tested. BH with q=0.05 rejected 1,000 H₀. What does this mean?

Practical Guidance: Which Method and When

Situation

Method

Why

Clinical trials, 2 - 5 endpoints

Bonferroni or Holm

One error = patient harm; strict FWER

Genomics, 500K+ SNPs

BH (FDR q=0.05)

~5% noise acceptable; power matters

Neuroimaging (voxel-wise)

FWE (GRF) or FDR

Spatial correlation; domain-specific tools

A/B test, 10+ metrics

BH or Holm

Depends on importance of each metric

Single comparison, no family

No correction

A lone test needs no correction

A pharmaceutical company simultaneously tests a new drug on 4 primary endpoints (mortality, stroke, myocardial infarction, hospitalisations) in one trial. Which correction should be used?

Key Ideas

With m tests, FWER = 1 − (1−α)^m: at m=20 this is 64% false discoveries!

Bonferroni: α/m - strict, conservative; for clinical and regulatory trials

Holm-Bonferroni: less conservative than Bonferroni, same FWER

BH (Benjamini-Hochberg): controls FDR (fraction of false discoveries); more powerful

q-value = adjusted p-value; FDR q < 0.05 → ≤5% false among significant findings

FWER: for clinical settings where one error is costly; FDR: for omics and discovery