Causal Calculus

do(X) Operator: Intervention vs Observation

Every neural network trains on $P(Y|X)$. The world runs on $\text{do}(X)$. A model trained on $P(\text{hospitalization}|\text{smoking})$ will not correctly predict $P(\text{hospitalization}|\text{do}(\text{quit smoking}))$ - not even with infinite data. This is not a data problem. It is a language problem. Judea Pearl spent 20 years explaining the difference mathematically.

**RCT vs observational:** clinical trials physically realize do(treatment) through randomization - cost USD 100-800M per drug. Causal inference on observational data is an attempt to get the same answer from observations, when the DAG allows it
**IRM (Arjovsky 2019):** Invariant Risk Minimization seeks features with an invariant predictor across environments - ML's attempt to move from Rung 1 to Rung 2 of Pearl's ladder without an explicit DAG
**XAI and counterfactuals:** 'what to change in X so that Y becomes y' is a direct application of the do-operator. LIME and SHAP work at Rung 1; true counterfactual explanations require Rung 3

Предварительные знания

DAG and d-separation: reading information flow in a causal graph
Backdoor criterion: when conditioning on Z gives an unbiased estimate

Frontdoor criterion

Observation vs Intervention

1995. Judea Pearl proves a statement that sounds philosophical but is rigorous mathematics: **correlation never becomes causation - not even with infinite data.** The only way out is a new operator in the language of probability.

Ordinary conditional probability $P(Y|X=x)$ is observation: select from the existing population those with $X = x$, then look at $Y$. The operator $\text{do}(X=x)$ is intervention: **surgically** set $X = x$ for the entire population, regardless of the natural causes of $X$.

In ML terms: every neural network trained on $P(Y|X)$ finds correlations. If in the training data smokers more often develop cancer - the model learns that association. But if cancer is caused by genetics (which also drives smoking), the model may predict: 'quitting smoking will not reduce cancer risk' - because observational data support it. The interventional question requires a different tool.

**Pearl's causal ladder (three rungs):** - **Rung 1 (Association):** $P(Y|X)$ - correlation, observation. Every ML algorithm lives here. - **Rung 2 (Intervention):** $P(Y|\text{do}(X))$ - intervention, causal effect. Requires a causal graph. - **Rung 3 (Counterfactual):** $P(Y_{x'}|X=x, Y=y)$ - 'what would have been'. Requires structural equations. ML in 2024 operates mostly at Rung 1. IRM, DML, causal discovery attempt to reach Rung 2.

In a dataset, hospitalized vaccinated patients die from COVID less often than hospitalized unvaccinated patients. Does this mean P(death|vaccine=1) < P(death|vaccine=0) captures the causal effect of vaccination?

Mutilated graph: surgery on a DAG

How to compute $P(Y|\text{do}(X=x))$ mathematically? Pearl gives an elegant answer via **graph surgery** (mutilation). The intervention $\text{do}(X=x)$ is equivalent to removing all incoming edges to $X$ and setting $X = x$.

Removing edges into X eliminates all causes that would naturally influence X in the real world. The only cause of X is now the intervention. This is exactly what randomized controlled trials do physically: randomization cuts the connection between patient characteristics and treatment.

**RCT vs observational study through the lens of do-operator:** - RCT: $P(\text{outcome}|\text{do}(\text{treatment}=1))$ - physically realized mutilation. Randomization = cutting all incoming edges. - Observational study: $P(\text{outcome}|\text{treatment}=1)$ - biased estimate when confounders are present. - Cost of the cut: RCT in pharmacology costs USD 100-800M. Causal inference on observational data costs computation - if the graph is identifiable.

What happens to the DAG edges when computing P(Y|do(X=x))?

Identifiability and IRM in ML

Graph mutilation is conceptually elegant - but computing $P(Y|\text{do}(X))$ requires data from the mutilated world (RCT). In practice one often has only observational data. The key question: **when is $P(Y|\text{do}(X))$ identifiable** - expressible through $P(Y, X, Z, \ldots)$ from observations?

The backdoor criterion gave the first answer: if $Z$ blocks all backdoor paths from $X$ to $Y$, then: $$P(Y=y|\text{do}(X=x)) = \sum_z P(Y=y|X=x, Z=z) P(Z=z)$$ This is the adjustment formula - the 'controlling for confounders' familiar from medical statistics, now with a precise condition for applicability.

**ML and causality: IRM (Arjovsky et al. 2019).** Standard ERM finds $\arg\min_h \mathbb{E}[L(h(X), Y)]$ - minimizes risk over the entire dataset. If data contain spurious correlations (e.g., background color correlates with class), ERM learns them. IRM seeks features that give an invariant predictor **across environments** - an attempt to find causal features stable under different $\text{do}$-interventions.

IRM is not a solution to the causality problem in ML, but the first systematic step. Counterfactual explanations in XAI ('what to change in $X$ to get a different $Y$') are another practical application of the do-operator in industry.

A large enough neural network will learn causal relationships from data

Without causal structure any ML algorithm is confined to Rung 1 of Pearl's ladder - statistical association

Pearl proved by theorem: there exist problems where $P(Y|\text{do}(X)) \neq P(Y|X)$ for all distributions, and no algorithm operating only on $P(X, Y, Z)$ can recover the do-distribution without additional assumptions about structure (DAG). This is not a computational limitation - it is an information barrier.

A model trained on P(Y|X) achieves high accuracy. Does this guarantee correct predictions of P(Y|do(X)) under distribution shift?

Key ideas

**$P(Y|X) \neq P(Y|\text{do}(X))$** in the presence of confounders - a fundamental distinction that does not shrink with more data
**Graph mutilation:** $\text{do}(X=x)$ = remove all incoming edges to X, fix X=x. Causal effect is computed in the mutilated world
**Pearl's ladder:** Rung 1 (correlation, all ML) - Rung 2 (intervention, do-operator) - Rung 3 (counterfactuals)
**Adjustment formula:** when the backdoor criterion holds, $P(Y|\text{do}(X)) = \sum_z P(Y|X,Z=z)P(Z=z)$ - causal effect from observational data
**ML and causality:** IRM seeks invariant features; counterfactual explanations in XAI are practical industry applications of the do-operator

What comes next

The do-operator opens the full system of Pearl's do-calculus:

Do-calculus — Three rules for transforming do-expressions - complete axiomatization of identification
Identifiability — When P(Y|do(X)) is computable from observations without explicit RCT
Counterfactuals — Rung 3: P(Y_x | X=x', Y=y') - what would have happened under a different decision
DAGs and d-separation — Foundation: reading information flows in a causal graph

Вопросы для размышления

Google Ads targets users who resemble past buyers. Is this estimating P(purchase|ad shown) or P(purchase|do(ad shown))? Which one does the advertiser actually want?
In large language models a token is selected according to P(token|context). This is Rung 1. Can an LLM reason correctly about do-operators without having causal structure?
The backdoor adjustment requires a measurable confounder Z. What if the confounder is unobserved? How does the frontdoor criterion help in that case?

Связанные уроки

cc-04-frontdoor — Frontdoor criterion: first example of computing do-expressions from observational data
cc-06-do-calculus — Pearl's three rules systematize transformations of do-expressions
cc-07-identifiability — Identifiability: when P(Y|do(X)) is computable from observational data
cc-09-counterfactuals — Counterfactuals - the third rung of the causal ladder, above interventions
lt-01-pac-intro — IRM (Invariant Risk Minimization) - ML's attempt to learn causal features rather than correlations
stat-01-sampling