Statistics
Causal Inference: Interventions and Counterfactuals
How can correlation be distinguished from causation when randomized experiments are impossible, unethical, or too expensive?
- **Amazon pricing experiments:** 1200 A/B tests in 2021; naive regression without randomization overestimated the price effect 3x due to hidden confounders
- **Medicine:** estimating the effect of smoking on cancer when randomizing exposure is impossible; observational adjustment must rely on covariates capturing the confounder set
- **Economics:** Card (1995) used distance to nearest college as an IV to estimate returns to education free of ability confounding
- **Policy:** minimum wage effects on employment using natural experiments at state borders as quasi-randomization
Предварительные знания
- Linear regression
- Conditional expectation
- Directed acyclic graphs
Causal inference rests on two formalisms: the Rubin Causal Model (potential outcomes framework) and Pearl's structural causal models (DAGs + do-calculus). Both answer the question: what would have happened to Y if T had been set to t, holding everything else constant?
Regression discontinuity design (RDD) exploits randomness around a cutoff threshold: if program assignment is determined by test score X > c, groups just above and below c are statistically similar - a local randomized experiment. The RDD estimate is tau_RDD = lim_{x→c+} E[Y|X=x] - lim_{x→c-} E[Y|X=x]. Applied to estimate the effect of minimum drinking age laws in the U.S. (cutoff at 21) on mortality.
Double/Debiased ML (Chernozhukov et al., 2018) combines causal inference with machine learning: nuisance functions (propensity score and conditional outcome) are estimated with flexible ML models, then a deconfounded treatment effect estimate is built via the Robinson (1988) orthogonalization. This allows powerful nonlinear ML methods for covariate control without introducing overfitting bias into the ATE estimate.
Synthetic control (Abadie et al., 2010) builds a counterfactual time series for a treated unit as a weighted average of untreated donor units. Weights w_j* minimize the distance between pre-treatment characteristics of the treated unit and the synthetic control. Applied to estimate the tobacco control law effect in California and the German reunification effect on GDP.
Pearl's ladder of causation (3 rungs): 1) Association - P(Y|X); 2) Intervention - P(Y|do(X)); 3) Counterfactuals - P(Y_x | X=x', Y=y). A randomized experiment directly accesses rung 2. Observational causal inference tries to climb from rung 1 to rung 2.
Propensity score matching reduces confounding without specifying a full outcome model: e(X) = P(T=1|X) is estimated by logistic regression or gradient boosting. The Rosenbaum-Rubin theorem (1983): if X suffices for identification (backdoor criterion), then e(X) also suffices. Matching on e(X) instead of the full vector X reduces the matching problem to one dimension while preserving covariate balance.
Sensitivity analysis for unmeasured confounding: Rosenbaum bounds quantify how much an unmeasured confounder would need to change the odds of treatment (by factor Gamma) to explain away the observed treatment effect. If the p-value remains significant for Gamma = 1.5, the conclusion is robust to a confounder that doubles the odds of treatment. The E-value (VanderWeele & Ding, 2017) is a related measure: minimum strength of association between confounder and both treatment and outcome needed to fully explain away the observed association.
Event study design tests for anticipation and post-treatment dynamics in panel data: estimate tau_l = E[Y_{t+l}(1) - Y_{t+l}(0)] for event-time l relative to treatment. Pre-treatment coefficients tau_{-k},...,tau_{-1} should be near zero (parallel trends check); post-treatment coefficients trace the treatment effect over time. Callaway-Sant'Anna (2021) extends DiD to staggered treatment adoption with heterogeneous timing.
potential outcomes and ATE
Rubin's potential outcomes framework defines for each subject i a pair Y_i(1), Y_i(0), the outcome under treatment and under control. Only Y_i = T_i Y_i(1) + (1-T_i) Y_i(0) is observed: the other potential outcome is always counterfactual. The Average Treatment Effect ATE = E[Y(1) - Y(0)] is identified under ignorability T ⊥ (Y(0), Y(1)) | X.
ATT (effect on the treated) = E[Y(1) - Y(0) | T=1] often differs from ATE when there is selection into the program: treatment goes to those who potentially benefit more. ATE and ATT coincide only under true randomization.
What is the fundamental problem of causal inference?
Holland (1986): 'No causation without manipulation, no causal inference without missing data.' We see Y_i = T_i Y_i(1) + (1-T_i) Y_i(0); the other half of each pair is always missing. Assumptions (ignorability, instruments, RDD) let us reconstruct the missing counterfactuals at the population level.
do-calculus and backdoor criterion
Pearl's structural framework models causal relations through a directed acyclic graph (DAG). The do(T=t) operator denotes intervention: 'force T = t', cutting the incoming edges of T. P(Y | do(T)) is a causal quantity distinct from the observational P(Y | T).
Which condition is required for a valid backdoor set Z in a DAG?
Backdoor criterion: (1) no node in Z is a descendant of T (otherwise we block the causal path); (2) Z d-separates all backdoor T ← ... → Y paths. Adding descendants of T introduces collider bias. Correct application gives P(Y | do(T)) via observable P(Y | T, Z) marginalised over Z.
instrumental variables
Instrument Z is a variable that affects Y only through T (exclusion restriction) and is uncorrelated with unobserved confounders U (exogeneity). It identifies the causal effect under unobserved T-Y confounding. Classic example: judges with different strictness as an instrument for incarceration when studying its effect on recidivism.
Weak instrument (cor(Z, T) ≈ 0) gives inconsistent estimates: the denominator is near zero, so the IV estimator has huge variance. Stock-Yogo rule of thumb: first-stage F-statistic should exceed 10.
Which condition is violated if the instrument Z has a direct effect on the outcome Y beyond T?
Exclusion restriction is the core of IV identification: the only allowed Z → Y channel is through T. With a direct Z → Y effect the ratio β_YZ / β_TZ mixes the causal T → Y effect with the direct Z → Y effect and becomes biased. The assumption is untestable from data and must be justified from domain theory.
Causal inference and related fields
Causal inference bridges statistics, econometrics, epidemiology, and machine learning through the shared task of effect identification.
- Econometrics — Historical source of IV, DiD, and RDD as natural-experiment designs
- Causal ML — Double ML and causal forests use flexible ML to estimate nuisance functions before orthogonalization
- Epidemiology — Confounder stratification and propensity scores grew out of cohort-study practice
Итоги
- ATE = E[Y(1) - Y(0)]; identified by randomization T independent of (Y(0), Y(1))
- Backdoor criterion: set Z blocks all noncausal paths T → Y; adjustment by Z gives P(Y|do(T))
- IV estimate = Cov(Y,Z)/Cov(T,Z); instrument must be relevant, excluded, and independent of confounders
- LATE - treatment effect for compliers; differs from ATE when effects are heterogeneous
- DiD removes permanent group and time effects under the parallel trends assumption