Statistics

Potential Outcomes Framework (Rubin Causal Model)

Цели урока

Distinguish observed outcomes from potential outcomes Y(1), Y(0)
Write down ATE, ATT, ATU, CATE and explain how they differ
Formulate ignorability and overlap and explain why they are untestable
Connect causal inference to off-policy evaluation in RL
Recognize selection bias as the gap between ATT and ATU

Предварительные знания

Causality basics and confounding
Idea of randomization in RCTs
Conditional expectations and basic regression

Normandy, 1944. The same soldier receives one of two orders. The counterfactual - what would have happened under the other order - is forever unknown. This is the fundamental problem of causality: for every unit only ONE of two potential outcomes is observed. Donald Rubin in 1974 turned this philosophical puzzle into a working mathematical apparatus, and Y(1), Y(0) became the foundation of modern causal inference - from FDA to Meta.

**FDA and pharma**: the Rubin model is the regulatory standard for estimating drug ATE before approval
**Tech A/B tests**: Meta, Google, Airbnb all use the potential outcomes framework for feature evaluation
**Healthcare**: personalized medicine is built on CATE estimation - who benefits more from a drug
**Reinforcement Learning**: off-policy evaluation is direct counterfactual reasoning
**Policy evaluation**: government programs (minimum wage, grants) are evaluated through ATE

Birth of modern causal inference

In 1974 Donald Rubin published 'Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies', formalizing the potential outcomes framework. The roots reach back to Jerzy Neyman's 1923 paper (in Polish, on agricultural experiments), but Rubin's contribution was extending the framework to observational studies, a conceptual revolution. In parallel, in the 1990s Judea Pearl developed a graph-based approach via DAGs and do-calculus. For a long time the Rubin and Pearl schools were rivals, although mathematically equivalent. In the 2010s both approaches merged into modern causal ML: Chernozhukov's DML (2018), Athey-Wager Causal Forests (2019), and uplift modeling in industry.

Y(1) and Y(0): two worlds of one unit

Normandy, 1944. A soldier receives one of two possible orders: advance, or hold position. Whichever order arrives, the second scenario - the counterfactual - will never be observed. This is the fundamental problem of causality: the same unit cannot simultaneously receive and not receive a treatment. Donald Rubin formalized this idea in 1974 under the name 'potential outcomes framework', and the simple notation Y(1) and Y(0) became the backbone of modern causal inference - from clinical trials to A/B tests at tech companies.

For each unit i (patient, user, region) two numbers are defined: Y_i(1) - the outcome under treatment, Y_i(0) - the outcome without treatment. The individual treatment effect is the difference tau_i = Y_i(1) - Y_i(0). If Ann took the drug and her blood pressure dropped 10 points, that is Y_Ann(1). What would have happened to the same Ann at the same moment without the drug - Y_Ann(0) - is forever unobservable.

Holland (1986) named this the 'fundamental problem of causal inference': it is impossible to simultaneously observe Y(1) and Y(0) for the same unit. Every causal method is a way around this problem through population-level averaging.

On social networks, vaccinating one user changes the infection probability of graph neighbors - that is interference. In recommender systems, showing one item to one user influences the algorithm for everyone else. When SUTVA fails, specialized methods are needed: cluster randomization, network experiments, marketplace experiments.

ML connection: offline policy evaluation

Production logs as potential outcomes

In offline reinforcement learning, every logged trajectory is one potential outcome: what happened under the current policy. What would have happened under a new policy is the counterfactual, never observed. Importance sampling estimates the expected outcome of the new policy by reweighting logs. This is the same logic as the Rubin model: recovering an unobservable potential outcome through statistical assumptions.

What is Holland's 'fundamental problem of causal inference'?

Every unit actually receives either T=1 or T=0, and the observed outcome corresponds to only one potential outcome. The other (counterfactual) outcome is forever missing. Every causal inference method is a strategy to work around this missing data problem.

ATE, ATT, CATE: a family of average effects

If individual effects are unobservable, what can actually be estimated? Answer: group averages. The three main quantities - ATE, ATT, ATU - differ in which subgroup is averaged over. And the differences between them are not statistical noise but a signal of selection bias, the principal enemy of observational studies.

ATE is the average effect if the drug were given to everyone. ATT is the average effect among those who actually received the drug. ATU is the average effect among those who did not (hypothetically). In an RCT all three coincide because the treatment group's distribution is statistically identical to the control's. In observational data they diverge - and that divergence is exactly what measures selection bias.

Hypertension drug

ATT and ATU diverge - treatment choice is not random

At a clinic the drug is prescribed to the sickest patients - those whose untreated pressure is 180+. They also have larger physiological room to drop: ATT = -20 units. Mild patients do not get the drug, and their potential effect is smaller: ATU = -8. The gap ATT - ATU = -12 signals that treatment assignment was non-random. If observational data is analyzed by simple difference of means, the resulting estimate mixes the causal effect with selection bias.

CATE (Conditional ATE) is a function of covariates: tau(x) = E[Y(1) - Y(0) | X = x]. It is no longer a single number but a whole function describing how the effect varies with unit features. CATE is exactly what is needed for personalized treatment recommendations.

Quantity	What it estimates	When it is useful
ATE	Effect on whole population	Policy decisions, mandatory programs
ATT	Effect on the treated	Evaluating an already-deployed program
ATU	Hypothetical effect on untreated	Whether to expand a program to others
CATE tau(x)	Effect as a function of features	Personalization: who should receive treatment

CATE estimation is the central task of causal ML. Meta uses X-learner and DR-learner to predict for whom an ad actually drives conversion (rather than merely correlates). Causal Forests (Athey, Wager 2019) give non-parametric tau(x) estimates. Uplift modeling in marketing is exactly CATE estimation: finding the 'persuadables' for whom intervention is maximally useful.

Simple comparison E[Y | T=1] - E[Y | T=0] in observational data is NOT equal to ATE. Holland decomposition: E[Y | T=1] - E[Y | T=0] = ATT + selection bias. Selection bias = E[Y(0) | T=1] - E[Y(0) | T=0]: the difference between what would have happened without treatment in the treated group versus the control group. In an RCT this term is zero by construction.

A clinic prescribes the drug only to the sickest patients. Which statement is correct?

When treatment is non-randomly assigned (selection on observables or on prognosis), ATT and ATU diverge. This is not statistical error but real heterogeneity between groups. That is why a plain difference of means in observational data is not ATE.

Ignorability and overlap: when ATE is even estimable

In an RCT, randomization guarantees independence of treatment and potential outcomes. In observational data there are no guarantees - assumptions are required. The two main ones: ignorability (conditional independence) and overlap (positivity). Without them any ATE estimate is a mixture of causality and selection bias. These assumptions cannot be tested from the data themselves - they are domain-knowledge hypotheses about mechanism.

Ignorability (also called strong ignorability or unconfoundedness) asserts: after conditioning on observed X, treatment T is statistically independent of potential outcomes. That is, all confounders are included in X. Under ignorability, ATE is identified as the difference of conditional expectations averaged over the distribution of X.

Overlap requires that within every stratum of X there is some probability of both receiving and not receiving treatment. If in stratum X = x all units are treated (P(T=1|X=x) = 1), then E[Y | T=0, X=x] is undefined and the conditional effect cannot be estimated. In practice overlap often breaks at the tails of the feature distribution.

To verify (Y(0), Y(1)) ⊥ T | X from data, both potential outcomes would need to be observed - impossible by the fundamental problem. Therefore ignorability is always a substantive hypothesis about the selection mechanism, not a statistical conclusion. Sensitivity analysis (lesson 46) evaluates how robust the result is to hidden violations of ignorability.

When ignorability breaks

A hidden confounder and an illusion of effect

Estimating the effect of an MBA on salary. X includes age, parents' education, GPA. Ignorability seems to hold. But 'ambition' and 'networking' - unobserved - influence both the decision to get an MBA and future salary. Conditional independence is violated. A plain regression yields an estimate that actually mixes the MBA effect with the ambition effect. IV (lesson 43) or RDD (lesson 45) can rescue the analysis if a suitable instrument or threshold exists.

No ML model can decide on its own which variables to include in X. Choosing confounders is a substantive decision requiring DAG structure (Pearl) or mechanism knowledge (Rubin). High-capacity models like causal forests or DML give efficient ATE/CATE estimates, but only conditional on the analyst correctly specifying what goes into X. ML improves estimation, not identification.

Overlap can be checked empirically: plot the propensity score e(X) = P(T=1|X) distribution in treated and control groups. If tails diverge sharply, overlap is broken. Trimming method: drop units with e(X) close to 0 or 1, sacrificing generalizability to preserve unbiasedness.

Why can ignorability not be tested from data alone?

Ignorability is an independence condition between T and POTENTIAL outcomes Y(0), Y(1). But by the fundamental problem, only one of the two is ever observed. So testing it directly from data is impossible - ignorability is always a substantive assumption about the selection mechanism.

Counterfactual reasoning and off-policy evaluation

Counterfactual reasoning - 'what would have happened if...' - is not a philosophical question but a computable quantity in the Rubin model. RCTs solve the task by randomization: if treatment is random, the control group is a valid counter for the treatment group because E[Y(0) | T=1] = E[Y(0) | T=0]. In observational data the hypothetical counterfactual must be constructed through models and assumptions.

This equality is the heart of randomized trials. The left-hand side is unobservable (what would have happened to the treated without treatment). The right-hand side is observed directly (what happened to the control). Randomization makes them equal by construction, and the simple difference of mean outcomes gives ATE without any assumption on X.

In off-policy evaluation in RL the logic is identical: estimate the expected reward of a new policy pi' from logs collected under old policy pi_0. The counterfactual - 'what if the agent had acted as pi' - is estimated by importance sampling: r_estimate = (1/N) sum (pi'(a|s) / pi_0(a|s)) * r.

ML connection: importance sampling in RL

Counterfactual evaluation of a new policy from old logs

A recommender system runs policy pi_0. Over a month, 100 million logs are collected: (state, action, reward). Data scientists trained a new policy pi' and want to estimate its expected conversion WITHOUT deploying. Inverse propensity scoring: for every logged triple, reweight reward by pi'(a|s) / pi_0(a|s). If pi' would have chosen the same action more often, weight is above one. The mean of reweighted rewards estimates V(pi'). This is a direct analog of IPTW for causal inference.

When pi_0 and pi' differ sharply, importance weights can be huge and V(pi') estimates become very noisy. This is the analog of overlap violation in causal inference: if P(T=1|X) is close to 0 or 1, weights 1/e(X) explode. Stabilization methods - clipping, snippet estimator, doubly robust - apply in both contexts.

CATE estimation is exactly uplift modeling, the central task of personalization. For each user, estimate tau(x) - how much extra purchase comes from showing the ad. Show only to those with tau(x) above threshold. Microsoft, Uber, Netflix use these methods to allocate promo codes, discounts, recommendations. Without a causal framing, uplift models confuse the intervention effect with baseline purchase propensity.

Counterfactual reasoning reshapes ML problem setup. Standard supervised learning estimates E[Y | X = x] - a correlational quantity. Causal ML estimates E[Y(1) - Y(0) | X = x] - what happens under intervention. The same data can be used for both tasks but they require different assumptions and different algorithms. A feature x = 'customer visited site' may correlate with purchase (correlation) without causing it (causation), if both depend on 'product interest'.

In off-policy RL evaluation, what plays the role of the propensity score?

Importance sampling reweights logs by pi'(a|s) / pi_0(a|s). The denominator - probability of the observed action under the data-collection policy - is the analog of e(X) in IPTW. When pi_0(a|s) is close to zero, weights explode. Same effect as positivity violation in causal inference.

Where the Rubin model leads

Potential outcomes are the foundation on which propensity score, IV, DiD, RDD, and sensitivity analysis are built. Each next method is a strategy for identifying Y(0), Y(1) under a different set of assumptions.

Confounders and Simpson's Paradox — The potential outcomes framework formalizes exactly when confounder adjustment gives valid causal estimates
Instrumental Variables — IV exploits exogenous variation to estimate LATE (Local Average Treatment Effect) within the potential outcomes framework

Key ideas

Each unit has two potential outcomes Y(1), Y(0); only one is observed
Individual effect tau_i = Y_i(1) - Y_i(0) is unobservable - the fundamental problem
ATE, ATT, ATU are subgroup averages; their gap diagnoses selection bias
CATE tau(x) is the conditional effect, basis of personalized interventions and uplift modeling
Ignorability and overlap make ATE identifiable from observational data
Ignorability is fundamentally untestable from data - always a substantive assumption
Off-policy evaluation in RL is direct application of counterfactual reasoning

Вопросы для размышления

Why is the individual causal effect tau_i never observed, yet ATE can be estimated?
In what real situations does SUTVA break, and which methods fix it?
How does CATE differ from ordinary supervised learning - both estimate a conditional expectation?
Which Rubin-model assumption corresponds to the boundedness of importance weights in off-policy RL?

Связанные уроки

stat-20-causal — Foundations of causality and confounding
stat-40-causal-rct — RCTs resolve the counterfactual problem
stat-42-causal-propensity — Propensity score builds on this model
stat-43-causal-iv — IV is an alternative when ignorability fails
prob-03-conditional