Causal Calculus

Double ML and CATE: Causal Forests

2013: Amazon tests pricing personalization across 50 million users. The average effect of a 10% discount is +2.3% conversion. But for users over 45 with a Premium subscription it is +8.1%, and for new mobile users it is -0.4%. The ATE of 2.3% hides this heterogeneity. Teams that decide on ATE alone lose money on every new mobile user and leave revenue on the table in the Premium segment.

  • **Personalized medicine:** VEGF inhibitors in lung cancer give ATE = +2.1 months survival. But for patients with EGFR mutation CATE = +8.3 months; for others CATE = -0.4 months. FDA approval restricted to EGFR+ patients was a direct consequence of CATE estimation (IPASS trial, 2009). Without heterogeneous analysis the drug might never have been approved.
  • **Employment policy:** a retraining program costs $3,000 per person. ATE on income one year later = +$800 (negative ROI). But CATE for graduates aged 25-35 without a college degree = +$4,200 (positive ROI). The German government uses this type of analysis to target programs - replacing universal subsidies with precision allocation.
  • **A/B testing and uplift modeling:** streaming services (Netflix, Spotify) estimate CATE for subscription offers. Price elasticity differs 3-5x between students and families. Double ML produces debiased CATE estimates from observational data - without running a separate A/B test for every segment.

Предварительные знания

  • Potential outcomes: Y(0), Y(1), ATE = E[Y(1)-Y(0)]
  • Ignorability (unconfoundedness): Y(0),Y(1) perp D | X
  • Propensity score: e(x) = P(D=1|X=x)
  • Regularization in ML (Lasso, Random Forest) and its biases
  • K-fold cross-validation
  • Causal Discovery: PC, FCI, NOTEARS
  • DAGs and potential outcomes
  • Counterfactuals and interventions

Conditional Average Treatment Effect (CATE)

**ATE (Average Treatment Effect)** is the population-wide average E[Y(1) - Y(0)]. It answers: 'Does the treatment work on average?' But averages hide important information. **CATE** is finer-grained: tau(x) = E[Y(1) - Y(0) | X = x] - the effect for a specific patient (or segment) with characteristics x. The gap between ATE and CATE is the gap between 'aspirin reduces fever' and 'aspirin reduces fever by 1.2 degrees in adult males with a temperature above 38.5 C'.

**Potential outcomes (Neyman-Rubin framework):** every unit has two potential outcomes - Y(0) under control and Y(1) under treatment. Only one is observed: Y = D*Y(1) + (1-D)*Y(0), where D is the treatment indicator. The fundamental problem of causal inference: both outcomes cannot be observed simultaneously for the same unit.

Effect heterogeneity is the norm, not the exception. In medicine, the same drug saves some patients and harms others depending on genotype. In economics, subsidies work for small businesses and are useless for large ones. In online services, a discount converts new users but erodes margins with loyal ones. Ignoring CATE means making averaged decisions when personalized ones are needed.

**Overlap trap:** if the propensity score P(D=1|X=x) is near 0 or 1 for some values of x, CATE estimates in that region become extremely unstable. No algorithm can reliably estimate CATE where the counterfactual is never observed. Always check overlap before interpreting results.

A study finds ATE of a new drug = +3 points on a health scale. A physician wants to prescribe it to all patients. What is the main argument against this?

Double/Debiased ML (Chernozhukov et al.)

In 2018, Victor Chernozhukov (MIT) and co-authors published 'Double/Debiased Machine Learning for Treatment and Structural Parameters' in the Econometrics Journal. The central problem: when ML (Lasso, random forests) is used to control for confounders, regularization biases the causal effect estimate - sometimes by as much as 100%. Double ML fixes this through a two-step partialling-out procedure.

**Cross-fitting is the key mechanism.** Without it: the nuisance model is trained on the same data used to form residuals. Regularization (L1/L2 in Lasso, tree splits) creates a correlation between fitted values and errors - this is regularization bias. Cross-fitting uses K folds: on each fold, nuisance models train on K-1 folds and predict on the held-out fold. This breaks the spurious correlation and restores root-n convergence of theta.

**Neyman orthogonality** is the mathematical foundation. The moment function psi(theta, eta) for estimating theta must have a zero derivative with respect to nuisance parameters eta at their true values: d/d_eta E[psi(theta_0, eta_0)] = 0. This means small errors in nuisance estimation produce only second-order error in theta - error in theta = O(||eta_hat - eta_0||^2), not O(||eta_hat - eta_0||). This is why arbitrary ML can be used for nuisance functions without compromising the quality of the theta estimate.

**Double ML estimates a homogeneous effect** (one scalar theta), not CATE. For heterogeneous effects, the R-learner extension or causal forests are required. Chernozhukov et al. proposed Partially Linear Regression (PLR) and interactive IV models, but the base algorithm targets a scalar theta.

Why does Double ML use cross-fitting (K-fold) rather than a simple train/test split?

Causal forests and meta-learners

2018: Stefan Wager and Susan Athey publish 'Estimation and Inference of Heterogeneous Treatment Effects using Random Forests' in JASA. Causal forests are a specialized case of Generalized Random Forests (GRF) where each tree optimizes not prediction MSE but variation in treatment effects across leaves. Instead of finding units that are 'similar in X', trees search for splits where units are 'similar in tau(X)'.

**Honest trees - the core idea.** An ordinary decision tree builds its structure (splits) and estimates leaf values on the same data, which biases leaf estimates. Honest trees split the sample: one half determines the tree structure (which splits to make), the other half estimates CATE in each leaf. This enables valid confidence intervals for tau(x).

**Choosing a meta-learner in practice:** S-learner is simple but regularization often shrinks the treatment effect estimate. T-learner works when groups are balanced. X-learner wins with highly imbalanced groups (treated << control or vice versa). R-learner (and causal forests as a special case) has the strongest theoretical guarantees: orthogonality to nuisance errors gives debiasing analogous to Double ML. On large datasets (>50k) differences diminish; on small ones, X and R-learners outperform T and S.

**Causal forests do not identify causal structure.** They assume the causal graph (or at minimum the set of confounders X) is already known and ignorability holds. Causal forests solve the task of estimating tau(x), not the task of selecting the correct X for conditional exogeneity.

Causal forests discover the causal structure of the data - which variables are causes and which are effects

Causal forests estimate the heterogeneous treatment effect tau(x) given a known causal structure. They assume confounders X are already identified and ignorability holds

The word 'causal' in the name means the method estimates a causal effect rather than a mere prediction, not that it discovers causal relationships. Identifying the causal graph is the job of causal discovery methods (PC, FCI, NOTEARS from the previous lesson).

What is the key difference between 'honest trees' in causal forests and ordinary decision trees?

Key ideas

  • **CATE tau(x) = E[Y(1)-Y(0)|X=x]** - the heterogeneous treatment effect personalized to subgroup x. ATE is the special case: averaging tau(x) over the distribution of X. In real data, effect heterogeneity is the rule, so ATE alone is rarely enough for decisions.
  • **Double ML (Chernozhukov 2018):** two-step debiasing - predict Y and D from X using any ML with cross-fitting, then run OLS on the residuals. Neyman orthogonality guarantees nuisance errors produce only second-order contamination of theta. Targets a scalar effect; heterogeneous tau(x) requires extensions.
  • **Causal forests (Wager & Athey 2018) and meta-learners:** causal forests are GRF with a CATE-variation criterion and honest trees for valid confidence intervals. Meta-learners (T/S/X/R) are flexible wrappers around any ML model. R-learner has the strongest theoretical guarantees. All methods assume a known causal structure and ignorability: they estimate CATE, they do not discover causes.

Related topics

Double ML and causal forests build on the identification framework and connect to several topics in the course:

  • DAGs and potential outcomes — Ignorability in CATE is the backdoor criterion from DAG theory. Choosing the correct set X for conditional effects requires knowing the graph.
  • Backdoor criterion — The set X in CATE must block all backdoor paths D <- ... -> Y. Causal forests do not select X automatically - that is an identification problem.
  • Counterfactuals and structural causal models — CATE is the counterfactual difference E[Y_do(D=1) - Y_do(D=0) | X=x]. SCMs provide the formal grounding for potential outcomes.

Вопросы для размышления

  • Double ML requires ignorability: all confounders must be included in X. How can one check in practice that no important confounder was omitted - and what happens to the CATE estimate when this assumption is violated?
  • R-learner and causal forests are orthogonal to nuisance errors - does this mean that with enough data they give the correct answer regardless of nuisance model quality? Are there situations where this breaks down?
  • Suppose causal forests show that the treatment effect for group A is significantly positive and for group B significantly negative. How does one make a management decision accounting for confidence intervals, tree honesty, and potential multiple hypothesis testing?

Связанные уроки

  • stat-01-sampling
Double ML and CATE: Causal Forests

0

1

Sign In