Logic

Correlation vs Causation

'Scientists prove: chocolate helps you lose weight.' This headline went around the world in 2015. The study was real, published in a journal. But it was a trap: journalist John Bohannon ran a deliberately bad study to show how easily the media confuse correlation with causation.

**Medicine:** for decades hormone replacement therapy was thought to protect the heart based on observational correlations. The Women's Health Initiative RCT showed the opposite: the therapy INCREASED risk
**Business:** 'Companies with corporate universities are more profitable.' Or maybe only profitable companies can afford universities?
**Politics:** 'Countries with greater press freedom are wealthier.' But wealth may cause freedom, not the other way around

Correlation

**Correlation** is a statistical association between two variables. When one variable goes up, the other also goes up (positive correlation) or down (negative correlation). Correlation shows that variables **move together**, but says nothing about **why**.

**The correlation coefficient** (r) ranges from -1 to +1. r = +1 is a perfect positive relationship, r = -1 a perfect negative one, r = 0 no relationship. Important: even a strong correlation (r = 0.95) does not prove causation.

Seeing a correlation, we instinctively look for a causal link. That is an evolutionary trait. Our ancestors saw the link between dark clouds and rain and drew the right conclusions. But the modern world is full of **random coincidences** and **hidden factors** that create the illusion of causation.

A study found: the more firefighters arrive at a fire, the larger the damage from the blaze (r = 0.82). Which conclusion is correct?

Causation

**Causation** is when one event **really causes** another. Unlike correlation, causation has **direction** and a **mechanism**. If A causes B, changing A will change B (but not the other way around).

**Bradford Hill criteria** for establishing causation: 1) Strength of association (the stronger, the more likely) 2) Consistency (replicates across conditions) 3) Specificity (a specific cause → a specific effect) 4) Temporality (cause precedes effect) 5) Dose-response gradient (more cause → more effect) 6) Plausibility (a mechanism exists) 7) Coherence (fits other knowledge) 8) Experiment (removing the cause removes the effect).

The only reliable way to establish causation is the **randomized controlled trial (RCT)**. We randomly split participants into groups, give one the 'cause' (a drug), the other a placebo. If only the first group shows the effect, causation is established.

A study found: kids who eat breakfast do better at school. What is needed to prove causation?

Confounders

**A confounder (confounding variable)** is a hidden factor that affects both the 'cause' and the 'effect', creating the illusion of a direct link. Confounders are the main source of false causal claims in observational studies.

**How to spot a confounder:** ask 'What else could affect BOTH variables?' Classic confounders: socioeconomic status, age, education, geography, season, overall lifestyle.

**Confounder control** has several routes: 1) **Randomization**, the gold standard, removes all confounders; 2) **Stratification**, separate analysis per group (men/women); 3) **Multivariate regression**, statistically 'subtracts' the confounder's effect; 4) **Matching**, pair up subjects with identical confounders.

A study: moderate drinkers live longer than non-drinkers. Which confounder is most likely?

Spurious correlations

**A spurious correlation** is a statistical association with neither a causal link nor a shared confounder. It arises from chance, especially under multiple comparisons, or from shared trends (population growth, technology, and so on).

**The multiple comparisons problem:** if you test 100 hypotheses at p < 0.05, on average 5 will appear 'significant' purely by chance. The site tylervigen.com collects absurd correlations: divorces in Maine correlate with margarine consumption (r = 0.99).

**How to recognize a spurious correlation:** 1) No plausible mechanism for the link; 2) Found by 'data dredging'; 3) Both variables are time series with a common trend; 4) The result fails to replicate on other data; 5) Researchers tested many hypotheses but only reported the 'successful' ones.

If a correlation is very strong (r > 0.9), the link is causal

Correlation strength does not imply causation. Spurious correlations can be arbitrarily strong

Two time series with a shared trend (both rising or falling) yield correlations close to 1 even when no link exists. The criterion for causation is mechanism plus experiment, not correlation strength.

A journalist found organic food sales correlating with autism diagnoses (r = 0.95). Which explanation is most likely?

Key Ideas

**Correlation is not causation:** a statistical link does not mean one thing causes another
**Four explanations of a correlation:** A→B, B→A, C→(A,B), or coincidence
**Confounders:** hidden variables affecting both measured ones
**Spurious correlations:** random alignments, especially in time series
**Gold standard:** only a randomized experiment proves causation

Вопросы для размышления

Recall a news story of the form 'X is linked to Y'. Which confounders could have produced the link?
When did you last draw a causal conclusion from an observation? Was it justified?
How would you test causation when an experiment is impossible (for instance, smoking causing cancer)?

Связанные уроки

stat-08-correlation