Statistics

Multiple Testing Correction

'We found the gene that causes schizophrenia!' - there were hundreds of such claims. Replication rate: 0%. The culprit: testing millions of SNPs without multiple testing correction. In 2005 Ioannidis showed that most published findings are false. Understanding multiple testing is a matter of scientific integrity.

  • GWAS: threshold p < 5×10^-8 instead of 0.05 for 1 million SNPs
  • Neuroimaging: thousands of voxels tested for activation - strict FWE or FDR required
  • A/B testing: companies test 10+ metrics and risk finding a 'significant' one by chance
  • Reproducibility crisis in psychology: many 2000s results failed to replicate
  • Clinical trials: regulatory agencies (FDA, EMA) require an explicit correction procedure

Предварительные знания

  • Hypothesis Testing: How p-values Killed 64,000 Studies

The FWER Problem and Bonferroni Correction

**The multiple testing problem:** with m independent tests at level α, the probability of at least one false discovery = 1 − (1−α)^m. At m=20 and α=0.05: P(at least one false) = 1 − 0.95^20 ≈ 64%! **FWER** (Familywise Error Rate) - probability of at least one Type I error across the family of tests. **Bonferroni correction:** α_adj = α/m. Conservative; provides strict FWER control.

**The story of p-hacking:** in 2005 John Ioannidis published 'Why Most Published Research Findings Are False'. The main culprit: researchers test many hypotheses but only publish significant results (publication bias) without correcting for multiple testing. In genomics: millions of SNPs, threshold p < 5×10^-8 instead of 0.05.

A researcher conducts 50 independent t-tests at α=0.05. How many false positives are expected if all H₀ are true?

FDR: The Benjamini-Hochberg Procedure

**FDR (False Discovery Rate)** - the expected fraction of false discoveries among all rejected H₀. FDR = E[V/R], where V = false rejections and R = total rejections. Less conservative than FWER (Bonferroni). **Benjamini-Hochberg (BH) procedure:** 1. sort p-values: p(1) ≤ p(2) ≤ ... ≤ p(m) 2. find the largest k such that p(k) ≤ (k/m) × q 3. reject H₁, ..., H_k.

**q-value**: the adjusted p-value under BH. q = p × m/rank. Widely used in genomics: 'FDR q < 0.05' means no more than 5% of declared discoveries are expected to be false. The `statsmodels.stats.multitest.multipletests` function implements all major correction methods.

GWAS: 500,000 SNPs tested. BH with q=0.05 rejected 1,000 H₀. What does this mean?

Practical Guidance: Which Method and When

**The choice of method depends on the cost of errors:** if one false discovery is catastrophic (new drug trial: one approved ineffective drug = patient harm) → FWER (Bonferroni or Holm). If ~5% 'noise' among discoveries is acceptable (genomics: we'll validate 1,000 candidates in the lab; 50 false is fine) → FDR (BH).

SituationMethodWhy
Clinical trials, 2 - 5 endpointsBonferroni or HolmOne error = patient harm; strict FWER
Genomics, 500K+ SNPsBH (FDR q=0.05)~5% noise acceptable; power matters
Neuroimaging (voxel-wise)FWE (GRF) or FDRSpatial correlation; domain-specific tools
A/B test, 10+ metricsBH or HolmDepends on importance of each metric
Single comparison, no familyNo correctionA lone test needs no correction

A pharmaceutical company simultaneously tests a new drug on 4 primary endpoints (mortality, stroke, myocardial infarction, hospitalisations) in one trial. Which correction should be used?

Key Ideas

  • With m tests, FWER = 1 − (1−α)^m: at m=20 this is 64% false discoveries!
  • Bonferroni: α/m - strict, conservative; for clinical and regulatory trials
  • Holm-Bonferroni: less conservative than Bonferroni, same FWER
  • BH (Benjamini-Hochberg): controls FDR (fraction of false discoveries); more powerful
  • q-value = adjusted p-value; FDR q < 0.05 → ≤5% false among significant findings
  • FWER: for clinical settings where one error is costly; FDR: for omics and discovery

Connections to Other Methods

Multiple testing extends hypothesis testing (families of tests), FDR is ubiquitous in bioinformatics (DESeq2, limma), and permutation tests provide exact FWER control (Westfall-Young).

  • Hypothesis Testing — Multiple testing extends the single-test framework to a family
  • Bootstrap and Resampling — Permutation tests with Westfall-Young provide exact FWER

Вопросы для размышления

  • Why does pre-registration of hypotheses solve part of the multiple testing problem?
  • In the A/B test one have 10 metrics. One is significant (p=0.02). Should applying Bonferroni or BH? How does the conclusion change?
  • What is the 'reproducibility crisis'? How do multiple testing and p-hacking contribute to false discoveries in science?

Связанные уроки

  • prob-04-bayes
Multiple Testing Correction

0

1

Sign In