Causal Calculus
DAG: a graph as the language of causality
1999: Judea Pearl publishes 'Causality'. The question 'what would have happened if' finally gets a mathematical language - the do-calculus. In 2011 - the Turing Award. DAG turns 'correlation is not causation' from a shrug into a tool.
- DoWhy / EconML (Microsoft): causal analysis in Bing, Office, Azure where A/B tests are too expensive or impossible
- RLHF reward modeling: DAG reveals the model learning style rather than quality - reward hacking
- Off-policy evaluation at Spotify, Netflix, Booking: estimating a new policy without deploying it via backdoor criterion
- Healthcare ML: propensity score from DAG for treatment effect estimation in observational data
Предварительные знания
Nodes, edges, and the three words of DAG
**For 30 years statistics chanted the mantra: "correlation is not causation".** For 30 years it was an excuse, not a tool. One could say "there's a confounder" and shrug. Then in 1988 Judea Pearl publishes "Probabilistic Reasoning in Intelligent Systems". The rules change: causality is no longer philosophy but a graph. Nodes, edges, direction. The phrase "X influences Y" stops being rhetoric and becomes a formula. In 2011 Pearl receives the Turing Award.
DAG stands for Directed Acyclic Graph. Three words carry weight. **Directed**: arrows go from cause to effect, not symmetric links. **Acyclic**: a cause cannot be its own effect; X → Y → X is forbidden. **Graph**: a discrete-math object with all its algorithms. Each edge X → Y is a claim: "an intervention on X will change Y". Each missing edge is a claim: "there is no direct influence".
A DAG makes hidden assumptions **visible**. When a data engineer decides whether to include a variable in a regression, they build a DAG in their head. Making the DAG explicit is not extra work - it is translating implicit assumptions into testable ones. Teams that draw DAGs before experiments catch confounding bugs before production. The rest catch them in postmortems.
What does a missing edge between X and Y in a DAG mean?
d-separation: reading conditional independencies from the graph
Any DAG is a combination of three elementary 3-node structures. **Chain**: X → Z → Y. Z is a mediator; X's influence on Y flows through Z. Conditioning on Z breaks the chain: X ⊥ Y | Z. **Fork**: X ← Z → Y. Z is a confounder, a common cause. X and Y correlate with no causal link between them. Conditioning on Z removes the dependence. **Collider**: X → Z ← Y. Z is a common effect. X and Y are independent! Conditioning on Z **creates** a spurious dependence.
A collider behaves **opposite** to a fork and a chain. In a fork, conditioning removes dependence. In a collider, conditioning creates it from nothing. Filtering data by Z (e.g., 'successful applicants only') makes X and Y correlated even when nothing in nature ties them. This is **collider bias** or **selection bias** - it has killed more empirical studies than all other mistakes combined.
A study of celebrities finds that attractive people seem less talented. Which DAG structure creates this artifact?
The do-operator: correlation does not imply causation
The core distinction between causal and statistical language: causal speaks of **interventions** (what happens if X is changed), statistical of **observations** (what is observed when X = x). P(Y | X = x) is a conditional distribution from observation. P(Y | do(X = x)) is the distribution after intervention. They differ in the presence of confounders. A barometer predicts rain: P(rain | low_barometer) is high. But breaking the barometer: P(rain | do(low_barometer)) equals the baseline probability. The barometer does not control weather.
Graphically, do(X = x) means cutting every arrow into X and fixing X = x. This removes confounders' influence on X. **Backdoor criterion**: a set Z blocks all backdoor paths (paths from X to Y through confounders) and contains no descendants of X. When the backdoor criterion is satisfied: P(Y | do(X)) = Σ_z P(Y | X, Z=z) P(Z=z) - the adjustment formula.
Ads are shown more often to users with high interest in the product. How does this affect the estimated ad effect?
Summary
- DAG = causal grammar: nodes are variables, directed edges are direct causal links, missing edges are independence claims
- Three bricks: chain (X→Z→Y, mediator), fork (X←Z→Y, confounder), collider (X→Z←Y, common effect)
- Conditioning on a mediator or confounder removes dependence; conditioning on a collider creates a spurious one
- P(Y|X=x) != P(Y|do(X=x)): observation vs intervention. The difference is backdoor paths through confounders
- A DAG is posited, not derived: data distinguishes only the Markov equivalence class; edge direction comes from time, experiments, or domain expertise
- Collider bias has killed more studies than all other mistakes combined. Cure: draw the DAG before analysis, mark colliders
Where to go next
DAG is the vocabulary. Next come the grammar and operations that turn a graph into a formula.
- d-separation — Algorithm for reading conditional independencies directly from the graph. Without it a DAG is a picture; with it a computable model
- Backdoor criterion — Which nodes to control for correct causal effect estimation. A direct consequence of fork and chain structure
- do-operator — Formalizing intervention. P(Y|X) to P(Y|do(X)) is the heart of causal inference
- Causal discovery — PC, FCI, NOTEARS: can a DAG be recovered from data, and where are the limits
Вопросы для размышления
- Which DAGs live implicitly in the architecture of the current project? What changes when they are drawn explicitly?
- Which A/B tests in the team carry risk of collider bias through data filtering?
- Which recent decisions to 'add a feature to the model' would be reconsidered if a DAG had been drawn before analysis?