Causal Calculus
do(X) Operator: Intervention vs Observation
Every neural network trains on $P(Y|X)$. The world runs on $\text{do}(X)$. A model trained on $P(\text{hospitalization}|\text{smoking})$ will not correctly predict $P(\text{hospitalization}|\text{do}(\text{quit smoking}))$ - not even with infinite data. This is not a data problem. It is a language problem. Judea Pearl spent 20 years explaining the difference mathematically.
- **RCT vs observational:** clinical trials physically realize do(treatment) through randomization - cost USD 100-800M per drug. Causal inference on observational data is an attempt to get the same answer from observations, when the DAG allows it
- **IRM (Arjovsky 2019):** Invariant Risk Minimization seeks features with an invariant predictor across environments - ML's attempt to move from Rung 1 to Rung 2 of Pearl's ladder without an explicit DAG
- **XAI and counterfactuals:** 'what to change in X so that Y becomes y' is a direct application of the do-operator. LIME and SHAP work at Rung 1; true counterfactual explanations require Rung 3
Предварительные знания
- DAG and d-separation: reading information flow in a causal graph
- Backdoor criterion: when conditioning on Z gives an unbiased estimate
Observation vs Intervention
1995. Judea Pearl proves a statement that sounds philosophical but is rigorous mathematics: **correlation never becomes causation - not even with infinite data.** The only way out is a new operator in the language of probability.
Ordinary conditional probability $P(Y|X=x)$ is observation: select from the existing population those with $X = x$, then look at $Y$. The operator $\text{do}(X=x)$ is intervention: **surgically** set $X = x$ for the entire population, regardless of the natural causes of $X$.
In ML terms: every neural network trained on $P(Y|X)$ finds correlations. If in the training data smokers more often develop cancer - the model learns that association. But if cancer is caused by genetics (which also drives smoking), the model may predict: 'quitting smoking will not reduce cancer risk' - because observational data support it. The interventional question requires a different tool.
**Pearl's causal ladder (three rungs):** - **Rung 1 (Association):** $P(Y|X)$ - correlation, observation. Every ML algorithm lives here. - **Rung 2 (Intervention):** $P(Y|\text{do}(X))$ - intervention, causal effect. Requires a causal graph. - **Rung 3 (Counterfactual):** $P(Y_{x'}|X=x, Y=y)$ - 'what would have been'. Requires structural equations. ML in 2024 operates mostly at Rung 1. IRM, DML, causal discovery attempt to reach Rung 2.
In a dataset, hospitalized vaccinated patients die from COVID less often than hospitalized unvaccinated patients. Does this mean P(death|vaccine=1) < P(death|vaccine=0) captures the causal effect of vaccination?
Mutilated graph: surgery on a DAG
How to compute $P(Y|\text{do}(X=x))$ mathematically? Pearl gives an elegant answer via **graph surgery** (mutilation). The intervention $\text{do}(X=x)$ is equivalent to removing all incoming edges to $X$ and setting $X = x$.
Removing edges into X eliminates all causes that would naturally influence X in the real world. The only cause of X is now the intervention. This is exactly what randomized controlled trials do physically: randomization cuts the connection between patient characteristics and treatment.
**RCT vs observational study through the lens of do-operator:** - RCT: $P(\text{outcome}|\text{do}(\text{treatment}=1))$ - physically realized mutilation. Randomization = cutting all incoming edges. - Observational study: $P(\text{outcome}|\text{treatment}=1)$ - biased estimate when confounders are present. - Cost of the cut: RCT in pharmacology costs USD 100-800M. Causal inference on observational data costs computation - if the graph is identifiable.
What happens to the DAG edges when computing P(Y|do(X=x))?
Identifiability and IRM in ML
Graph mutilation is conceptually elegant - but computing $P(Y|\text{do}(X))$ requires data from the mutilated world (RCT). In practice one often has only observational data. The key question: **when is $P(Y|\text{do}(X))$ identifiable** - expressible through $P(Y, X, Z, \ldots)$ from observations?
The backdoor criterion gave the first answer: if $Z$ blocks all backdoor paths from $X$ to $Y$, then: $$P(Y=y|\text{do}(X=x)) = \sum_z P(Y=y|X=x, Z=z) P(Z=z)$$ This is the adjustment formula - the 'controlling for confounders' familiar from medical statistics, now with a precise condition for applicability.
**ML and causality: IRM (Arjovsky et al. 2019).** Standard ERM finds $\arg\min_h \mathbb{E}[L(h(X), Y)]$ - minimizes risk over the entire dataset. If data contain spurious correlations (e.g., background color correlates with class), ERM learns them. IRM seeks features that give an invariant predictor **across environments** - an attempt to find causal features stable under different $\text{do}$-interventions.
IRM is not a solution to the causality problem in ML, but the first systematic step. Counterfactual explanations in XAI ('what to change in $X$ to get a different $Y$') are another practical application of the do-operator in industry.
A large enough neural network will learn causal relationships from data
Without causal structure any ML algorithm is confined to Rung 1 of Pearl's ladder - statistical association
Pearl proved by theorem: there exist problems where $P(Y|\text{do}(X)) \neq P(Y|X)$ for all distributions, and no algorithm operating only on $P(X, Y, Z)$ can recover the do-distribution without additional assumptions about structure (DAG). This is not a computational limitation - it is an information barrier.
A model trained on P(Y|X) achieves high accuracy. Does this guarantee correct predictions of P(Y|do(X)) under distribution shift?
Key ideas
- **$P(Y|X) \neq P(Y|\text{do}(X))$** in the presence of confounders - a fundamental distinction that does not shrink with more data
- **Graph mutilation:** $\text{do}(X=x)$ = remove all incoming edges to X, fix X=x. Causal effect is computed in the mutilated world
- **Pearl's ladder:** Rung 1 (correlation, all ML) - Rung 2 (intervention, do-operator) - Rung 3 (counterfactuals)
- **Adjustment formula:** when the backdoor criterion holds, $P(Y|\text{do}(X)) = \sum_z P(Y|X,Z=z)P(Z=z)$ - causal effect from observational data
- **ML and causality:** IRM seeks invariant features; counterfactual explanations in XAI are practical industry applications of the do-operator
What comes next
The do-operator opens the full system of Pearl's do-calculus:
- Do-calculus — Three rules for transforming do-expressions - complete axiomatization of identification
- Identifiability — When P(Y|do(X)) is computable from observations without explicit RCT
- Counterfactuals — Rung 3: P(Y_x | X=x', Y=y') - what would have happened under a different decision
- DAGs and d-separation — Foundation: reading information flows in a causal graph
Вопросы для размышления
- Google Ads targets users who resemble past buyers. Is this estimating P(purchase|ad shown) or P(purchase|do(ad shown))? Which one does the advertiser actually want?
- In large language models a token is selected according to P(token|context). This is Rung 1. Can an LLM reason correctly about do-operators without having causal structure?
- The backdoor adjustment requires a measurable confounder Z. What if the confounder is unobserved? How does the frontdoor criterion help in that case?
Связанные уроки
- cc-04-frontdoor — Frontdoor criterion: first example of computing do-expressions from observational data
- cc-06-do-calculus — Pearl's three rules systematize transformations of do-expressions
- cc-07-identifiability — Identifiability: when P(Y|do(X)) is computable from observational data
- cc-09-counterfactuals — Counterfactuals - the third rung of the causal ladder, above interventions
- lt-01-pac-intro — IRM (Invariant Risk Minimization) - ML's attempt to learn causal features rather than correlations
- stat-01-sampling