Statistics
Confounders and Simpson's Paradox
Цели урока
- Identify a confounder from a DAG and distinguish it from a mediator or collider
- Recognize Simpson's paradox in production metrics
- Apply back-door adjustment to estimate a causal effect
- Build a minimal adjustment set from a causal graph
- Explain why randomization eliminates confounding
Предварительные знания
- Basic causality and the do-operator
- Conditional distributions and conditional independence
- Linear regression
The same Berkeley dataset in 1973 simultaneously proved and refuted gender discrimination in graduate admissions. The lawsuit collapsed not because of a new survey but because a statistician split the data by department. The case entered every textbook as the canonical confounding example and helped trigger the formal development of causal inference.
- **Instagram and TikTok recommenders**: models train on biased engagement data, Simpson's paradox shows up at scale - aggregate metrics rise while satisfaction drops
- **Clinical trials**: older patients receive more treatments and have more deaths - naive analysis makes treatments look lethal
- **Google search quality**: aggregate CTR rises while per-query CTR falls because the query mix shifts
- **RLHF in LLMs**: reward models trained on preference scores without controlling for prompt difficulty learn to confuse verbosity with quality
A paradox with three authors and one name
E.H. Simpson published 'The Interpretation of Interaction in Contingency Tables' in 1951, describing the paradox that now carries his name. Simpson himself noted the result was 'not new': the same phenomenon was discussed by Udny Yule in 1903 and by Karl Pearson around the same decade. The modern mechanistic framework, Directed Acyclic Graphs, was formalized only in the 1980s and 1990s by Judea Pearl, who won the Turing Award for this work in 2011.
What a confounder is
1968, UC Berkeley. Graduate school admitted 44% of male applicants and only 35% of female applicants. Newspapers ran with the discrimination story and the university was sued. Then a statistician broke the numbers down by department - and the direction reversed: in almost every single department, women were admitted at equal or higher rates than men. The very same dataset proved both discrimination and the absence of it. Both views are technically correct, and that contradiction is the heart of Simpson's Paradox.
A confounding variable is a third variable Z that influences both the supposed cause X and the outcome Y. Because of Z, the observed X-Y correlation no longer reflects the true causal effect. In the Berkeley case the confounder was choice of department: women applied disproportionately to humanities programs with low admission rates, men to engineering programs with high admission rates.
This DAG reads: Z causally drives both X and Y. Measuring the X-Y relationship without conditioning on Z yields a spurious correlation that has nothing to do with any direct causal path.
A confounder is not just any third variable. It must sit on a back-door path from X to Y. Mediators (on the causal pipeline) and colliders (common children) require completely different handling.
ML angle: hospital-trained models
Selection bias and confounding combine to break a production model.
A mortality prediction model is trained on data from one large hospital. Test AUC is 0.92, but in production accuracy collapses. The reason: hospitals admit predominantly severe cases, healthier patients are treated as outpatients. Severity is a confounder between symptoms and outcome. The model learned a distribution in which severe symptoms almost always lead to bad outcomes - and never saw mild patients at all.
Confounding is the main technical reason correlation differs from causation. Without an explicit model of Z, any regression of Y on X measures a mixture of the causal effect and confounding bias.
A credit scoring model is trained only on historical loans the bank approved. Performance degrades on fresh applications. What is the structural issue?
Approval is a confounder between application features and repayment. The model only sees applicants the bank deemed reliable - it never learns from rejections. This is structural, not an overfitting issue.
Simpson's Paradox: the mechanism
Simpson's Paradox is a formal effect: the direction of a statistical association within each stratum of a confounder can be opposite to the direction of the aggregate association. Not a visual illusion, not a computational mistake - a clean arithmetic artifact of weighted averaging.
| Department | Male apps | Male admit % | Female apps | Female admit % |
|---|---|---|---|---|
| A | 825 | 62% | 108 | 82% |
| B | 560 | 63% | 25 | 68% |
| C | 325 | 37% | 593 | 34% |
| D | 417 | 33% | 375 | 35% |
| E | 191 | 28% | 393 | 24% |
| F | 373 | 6% | 341 | 7% |
| Total | 2691 | 44% | 1835 | 35% |
At the department level women are admitted at equal or higher rates in most cases. In the aggregate they lose by 9 percentage points. The cause: women apply heavily to departments C-F with low admission rates, men apply to A and B with high admission rates.
Kidney stones: treatment A vs B
The canonical medical instance, Charig 1986.
Treatment A succeeds in 78% of cases overall, B in 83%. B looks better. But for small stones A succeeds 93%, B succeeds 87%. For large stones A succeeds 73%, B succeeds 69%. A wins inside both subgroups. Stone size is the confounder: doctors assigned A to harder cases with large stones, dragging A's overall rate down. Treatment A is objectively better.
In production ML monitoring Simpson's paradox is especially treacherous. Overall accuracy can rise release after release while accuracy in every demographic stratum falls. Shifting traffic mix is enough to fool the team.
When comparing two policies or models, watch for very different population mixes on a key feature. If the mix is observational rather than experimentally fixed, Simpson is on the table.
Small-sample noise can also flip an aggregate effect. Simpson's paradox specifically requires structural distortion through a confounder, not random fluctuation.
An A/B test of a new ranker shows global CTR up 2%, yet CTR drops inside every individual country. What happened?
Country is a confounder. If the treatment arm got more users from naturally high-CTR markets, the global metric improves even though the ranker hurts clicks everywhere.
Controlling for confounders
When confounders are observable and measured, the causal effect can be recovered. Three core techniques: stratification (analyze inside strata), matching (pair treatment and control with similar Z), regression adjustment (include Z as covariates).
Pearl's back-door adjustment formula. The left side is what would happen if X were forcibly set to x (intervention via the do-operator). The right side is computable from observed data, provided Z blocks every back-door path from X to Y.
Randomization (RCT) makes X independent of every Z by construction, so P(Z|X) = P(Z) and the naive difference of means equals the true causal effect. That is the structural reason A/B tests are the gold standard for causal inference.
If treatment assignment is not strictly random (for example, users self-select into a beta feature), conditioning on observed covariates will not save the analysis from unobserved confounders. IV, RDD, or DiD become necessary.
Regression adjustment of Y on X plus Z produces an unbiased effect estimate only under linearity and correct specification. In practice combine with propensity scores or use doubly robust methods.
Which condition must hold for back-door adjustment with a set Z to deliver a valid causal effect?
This is Pearl's formal back-door criterion. Including a descendant of X in Z creates collider bias and breaks the estimate even when correlations look reasonable.
DAGs: formalizing causal structure
A Directed Acyclic Graph is the language modern causal inference uses to record assumptions about dependence. Nodes are variables, arrows are direct causal links, no cycles allowed. A DAG immediately tells the analyst what to condition on and what to leave alone.
A back-door path from X to Y is any undirected path that starts with an arrow into X (that is, X<-...). If such a path exists and is not blocked by conditioning, it injects bias.
| Structure | Action | Cost of the wrong call |
|---|---|---|
| Confounder: X <- Z -> Y | Condition on Z | Without Z: spurious correlation |
| Mediator: X -> M -> Y | Do not condition on M | With M: real effect disappears |
| Collider: X -> C <- Y | Do not condition on C | With C: a fake link appears |
| Proxy of a confounder: X <- Z -> W -> Y | Use W if Z is hidden | Partial correction |
ML angle: hospital readmission
Feature selection without a causal graph yields fragile models.
A model predicts 30-day hospital readmission. One feature is 'discharge against medical advice' (DAMA). DAMA is a collider: patient condition affects it, patient personality affects it, and both affect readmission. Conditioning on DAMA induces a spurious link between condition and personality. Production performance collapses the moment the hospital changes its discharge policy.
Multiple Z sets can satisfy the back-door criterion. The minimal one usually gives lower estimator variance. Tools such as dagitty and DoWhy find these sets automatically.
The graph is drawn by a human. A missed arrow or an unmodeled hidden confounder leaves every downstream effect biased. DAGs do not replace domain knowledge - they make it explicit.
In a DAG: education (E) -> income (I), health (H) -> income (I), and education (E) -> health (H). The target is the effect of E on I. What belongs in the adjustment set?
H sits on the causal pipeline E -> H -> I, so it is a mediator, not a confounder. Conditioning on H blocks part of the causal flow and biases the estimate downward.
Where this leads
Confounders are the central problem of causal inference. Every later method is a different way to defeat them.
- Randomized Controlled Trials — Randomization is the gold standard antidote to confounding - it breaks the back-door path between treatment and confounder
- Propensity Score Methods — Propensity score matching and IPW adjust for measured confounders when randomization is impossible
Key takeaways
- A confounder Z simultaneously drives the cause X and the outcome Y, producing spurious correlation
- Simpson's paradox is an arithmetic reversal of effect direction caused by aggregating across a confounder
- Back-door adjustment: average the conditional effect over marginal P(Z), not over P(Z|X)
- A DAG formalizes causal structure and tells the analyst what to control for
- Colliders are the opposite hazard: conditioning on one induces fake correlation
- Randomization makes X independent of every Z by construction - the radical antidote to confounding
Вопросы для размышления
- Which confounders may exist in the team's current production dataset and how to surface them?
- Could Simpson's paradox be lurking in the headline metric used to ship products?
- Among current model features, which are mediators and which are colliders?
- What needs to change in data collection to make the dominant confounder neutralizable?
Связанные уроки
- stat-20-causal — basic causal vocabulary and DAGs
- stat-40-causal-rct — RCTs neutralize confounders by design
- stat-42-causal-propensity — propensity score balances observed confounders
- stat-30-stats-ml — selection bias in ML training data
- prob-04-bayes