Causal Calculus

DAG: a graph as the language of causality

1999: Judea Pearl publishes 'Causality'. The question 'what would have happened if' finally gets a mathematical language - the do-calculus. In 2011 - the Turing Award. DAG turns 'correlation is not causation' from a shrug into a tool.

DoWhy / EconML (Microsoft): causal analysis in Bing, Office, Azure where A/B tests are too expensive or impossible
RLHF reward modeling: DAG reveals the model learning style rather than quality - reward hacking
Off-policy evaluation at Spotify, Netflix, Booking: estimating a new policy without deploying it via backdoor criterion
Healthcare ML: propensity score from DAG for treatment effect estimation in observational data

Предварительные знания

Causal Inference

Nodes, edges, and the three words of DAG

**For 30 years statistics chanted the mantra: "correlation is not causation".** For 30 years it was an excuse, not a tool. One could say "there's a confounder" and shrug. Then in 1988 Judea Pearl publishes "Probabilistic Reasoning in Intelligent Systems". The rules change: causality is no longer philosophy but a graph. Nodes, edges, direction. The phrase "X influences Y" stops being rhetoric and becomes a formula. In 2011 Pearl receives the Turing Award.

DAG stands for Directed Acyclic Graph. Three words carry weight. **Directed**: arrows go from cause to effect, not symmetric links. **Acyclic**: a cause cannot be its own effect; X → Y → X is forbidden. **Graph**: a discrete-math object with all its algorithms. Each edge X → Y is a claim: "an intervention on X will change Y". Each missing edge is a claim: "there is no direct influence".

A DAG makes hidden assumptions **visible**. When a data engineer decides whether to include a variable in a regression, they build a DAG in their head. Making the DAG explicit is not extra work - it is translating implicit assumptions into testable ones. Teams that draw DAGs before experiments catch confounding bugs before production. The rest catch them in postmortems.

What does a missing edge between X and Y in a DAG mean?

d-separation: reading conditional independencies from the graph

Any DAG is a combination of three elementary 3-node structures. **Chain**: X → Z → Y. Z is a mediator; X's influence on Y flows through Z. Conditioning on Z breaks the chain: X ⊥ Y | Z. **Fork**: X ← Z → Y. Z is a confounder, a common cause. X and Y correlate with no causal link between them. Conditioning on Z removes the dependence. **Collider**: X → Z ← Y. Z is a common effect. X and Y are independent! Conditioning on Z **creates** a spurious dependence.

A collider behaves **opposite** to a fork and a chain. In a fork, conditioning removes dependence. In a collider, conditioning creates it from nothing. Filtering data by Z (e.g., 'successful applicants only') makes X and Y correlated even when nothing in nature ties them. This is **collider bias** or **selection bias** - it has killed more empirical studies than all other mistakes combined.

A study of celebrities finds that attractive people seem less talented. Which DAG structure creates this artifact?

The do-operator: correlation does not imply causation

The core distinction between causal and statistical language: causal speaks of **interventions** (what happens if X is changed), statistical of **observations** (what is observed when X = x). P(Y | X = x) is a conditional distribution from observation. P(Y | do(X = x)) is the distribution after intervention. They differ in the presence of confounders. A barometer predicts rain: P(rain | low_barometer) is high. But breaking the barometer: P(rain | do(low_barometer)) equals the baseline probability. The barometer does not control weather.

Graphically, do(X = x) means cutting every arrow into X and fixing X = x. This removes confounders' influence on X. **Backdoor criterion**: a set Z blocks all backdoor paths (paths from X to Y through confounders) and contains no descendants of X. When the backdoor criterion is satisfied: P(Y | do(X)) = Σ_z P(Y | X, Z=z) P(Z=z) - the adjustment formula.

Ads are shown more often to users with high interest in the product. How does this affect the estimated ad effect?

Summary

DAG = causal grammar: nodes are variables, directed edges are direct causal links, missing edges are independence claims
Three bricks: chain (X→Z→Y, mediator), fork (X←Z→Y, confounder), collider (X→Z←Y, common effect)
Conditioning on a mediator or confounder removes dependence; conditioning on a collider creates a spurious one
P(Y|X=x) != P(Y|do(X=x)): observation vs intervention. The difference is backdoor paths through confounders
A DAG is posited, not derived: data distinguishes only the Markov equivalence class; edge direction comes from time, experiments, or domain expertise
Collider bias has killed more studies than all other mistakes combined. Cure: draw the DAG before analysis, mark colliders

Where to go next

DAG is the vocabulary. Next come the grammar and operations that turn a graph into a formula.

d-separation — Algorithm for reading conditional independencies directly from the graph. Without it a DAG is a picture; with it a computable model
Backdoor criterion — Which nodes to control for correct causal effect estimation. A direct consequence of fork and chain structure
do-operator — Formalizing intervention. P(Y|X) to P(Y|do(X)) is the heart of causal inference
Causal discovery — PC, FCI, NOTEARS: can a DAG be recovered from data, and where are the limits

Вопросы для размышления

Which DAGs live implicitly in the architecture of the current project? What changes when they are drawn explicitly?
Which A/B tests in the team carry risk of collider bias through data filtering?
Which recent decisions to 'add a feature to the model' would be reconsidered if a DAG had been drawn before analysis?

Связанные уроки

stat-27-graphical-models

Nodes, edges, and the three words of DAG

What does a missing edge between X and Y in a DAG mean?

d-separation: reading conditional independencies from the graph

A study of celebrities finds that attractive people seem less talented. Which DAG structure creates this artifact?

The do-operator: correlation does not imply causation

Ads are shown more often to users with high interest in the product. How does this affect the estimated ad effect?

Summary

DAG = causal grammar: nodes are variables, directed edges are direct causal links, missing edges are independence claims

Three bricks: chain (X→Z→Y, mediator), fork (X←Z→Y, confounder), collider (X→Z←Y, common effect)

Conditioning on a mediator or confounder removes dependence; conditioning on a collider creates a spurious one

P(Y|X=x) != P(Y|do(X=x)): observation vs intervention. The difference is backdoor paths through confounders

A DAG is posited, not derived: data distinguishes only the Markov equivalence class; edge direction comes from time, experiments, or domain expertise

Collider bias has killed more studies than all other mistakes combined. Cure: draw the DAG before analysis, mark colliders