Statistics

Propensity Score: Matching and Inverse Probability Weighting

Цели урока

Write down the propensity score e(X) = P(T=1|X) and the balancing property
Apply matching by propensity score (1:1, caliper, k-NN)
Compute ATE via IPTW and recognize positivity violations
Understand the doubly robust estimator and the DML framework
Connect propensity score to unbiased offline evaluation in ML

Предварительные знания

Potential outcomes model Y(1), Y(0)
Ignorability and overlap
Logistic regression and gradient boosting

A pharmaceutical company compares two cholesterol drugs, but no RCT was ever run. Drug A was prescribed to younger, healthier patients, B to older, sicker ones. Direct comparison is meaningless. In 1983 Rosenbaum and Rubin published a theorem: all confounding is controlled by conditioning on ONE number - the probability of receiving treatment. That number is the propensity score. Hundreds of FDA observational studies, thousands of tech marketing campaigns, countless uplift models all rest on the propensity score introduced in a single 1983 paper.

**Pharma and epidemiology**: foundation of observational drug evaluation, thousands of JAMA and NEJM papers
**Tech**: Uber, Meta, Airbnb use propensity score for feature effect estimation without A/B
**Marketing and uplift modeling**: matching and weighting underlie personalized promotions
**Recommender systems**: unbiased offline evaluation (Schnabel 2016) applies IPS
**Policy evaluation**: government program effects on CPS, ACS, and other registries

The theorem that changed everything

In 1983 Paul Rosenbaum and Donald Rubin published 'The Central Role of the Propensity Score in Observational Studies for Causal Effects' in Biometrika. Before them, matching had to be done simultaneously on dozens of covariates - computationally impossible and statistically noisy (curse of dimensionality). Their theorem collapsed everything to one scalar. For decades the propensity score was estimated by logistic regression. In the 2010s McCaffrey, Lee and others showed the advantages of gradient boosting for high-dimensional X. In 2018 Chernozhukov and coauthors in 'Double/Debiased Machine Learning' gave the theoretical foundation for using arbitrary ML models via cross-fitting and orthogonal moments - this unified propensity score with modern ML.

e(X) = P(T=1|X): one number that balances everything

A pharmaceutical company wants to compare two cholesterol drugs - but no RCT was ever run. Drug A was prescribed to younger, healthier patients. Drug B to older, sicker ones. A direct outcome comparison is meaningless: B looks 'worse' simply because it was given to worse-starting patients. In 1983 Paul Rosenbaum and Donald Rubin published a single theorem: all confounding can be controlled by conditioning on one number - the probability of receiving treatment given pre-treatment characteristics. This scalar is the propensity score. The theorem transformed observational research.

The key property of the propensity score is the balancing property: within strata of equal e(X) the covariates X are statistically independent of treatment T. Two patients with the same e(X) may have different age and history, but on average these features balance out between treated and control. Conditioning on the scalar e(X) is equivalent to conditioning on the entire vector X.

In practice e(X) is unknown and is estimated from data. Classical choice: logistic regression. Modern practice: gradient boosting or random forests, which better capture nonlinearities and interactions in high-dimensional X.

In medical registries X can contain hundreds of features: lab values, ICD-10 diagnoses, drug interactions. Logistic regression assumes linearity in log-odds, almost never true. XGBoost and LightGBM automatically capture interactions and nonlinearities, giving a more accurate e(X). Lee, Lessler, Stuart (2010) empirically showed: GBM-based propensity scores give lower bias and better balance in real observational data.

The goal of estimating e(X) is balance, not prediction of T. A model with AUC = 0.99 is actually bad: it separates the groups too well, and overlap is violated. A good propensity score gives moderate AUC (0.6-0.85) and produces balanced covariates after weighting. Diagnostics: after weighting/matching, check standardized mean differences (SMD) - they should be below 0.1.

Numbers from a real clinic

Covariate balancing through the propensity score

Registry of 10000 patients. Before correction: mean age in drug A group = 52, in drug B = 68 (SMD = 0.85, huge imbalance). e(X) was estimated via XGBoost on 47 covariates. After matching on e(X): mean age A = 60, B = 61 (SMD = 0.03). Gender, history, lab values also balanced. Now outcome comparison is causally interpretable.

What does the balancing property of the propensity score state?

Rosenbaum-Rubin theorem: conditioning on the scalar e(X) is equivalent in balancing power to conditioning on the full vector X. This is the foundation of propensity score use - collapsing dimensionality from dozens of covariates to one without losing identification.

Matching: finding a close control for every treated unit

The matching idea is simple: for every treated patient, find a control with close propensity score. If e(X_i) for a treated unit is 0.72 and for some control also 0.72, then by the balancing property their covariates are on average identical - and comparing their outcomes gives an unbiased estimate of the individual effect. Averaging over all matched pairs yields ATT. This is the causal analog of k-NN in propensity space.

Matching method	Description	When to use
1:1 nearest neighbor	Each treated gets one nearest control	Small samples, easy interpretation
Caliper matching	Only pairs with \|e_i - e_j\| < caliper	Partial overlap, avoid bad matches
1:k matching	Each treated gets k nearest controls	More precision, lower variance
Full matching	Variable group sizes, optimal	Best balance, harder to interpret

Caliper is typically set to 0.2 * SD(logit(e(X))). If the nearest control is farther than the caliper, the treated unit is dropped. This shrinks the sample but protects against bad matches in regions of weak overlap.

Matching workflow on cholesterol data

Step by step from X to ATT

Step 1: estimate e(X) with XGBoost, obtain a vector of propensities. Step 2: for each treated unit (drug A, n=3000) find the nearest control (drug B, n=7000) by logit e(X). Step 3: apply caliper = 0.2 * SD: about 5 percent of treated units are dropped. Step 4: on the matched sample compute mean SMD - all below 0.1. Step 5: ATT = mean of (Y_treated - Y_control) over pairs = -22 mg/dL LDL reduction. Confidence interval via bootstrap.

If an important confounder is absent from X (such as motivation or physical activity not logged), then e(X) is mis-estimated and matching cannot fix the bias. Propensity score is compression of observed X, not magic. Sensitivity analysis assesses how strong an unobserved confounder must be to wipe out the effect.

Airbnb compares host strategies (for example, instant booking vs request-based) on observational data. Thousands of listings, dozens of covariates - location, price, photos, reviews. Propensity score via GBM, matching via kd-tree or faiss for speed. Similar techniques are used by Yelp for evaluating premium listing effects, and by Uber for promo code effects. Matching in propensity space is k-NN with a learned embedding.

Here m(i) is the index of the control matched to the i-th treated unit. The formula is simply the average within-pair difference. Confidence intervals are not computed by standard OLS formulas: pairs are not independent (one control may be reused), so bootstrap or specialized robust SE for matched data (Abadie, Imbens 2006) are used.

What is the purpose of a caliper in matching?

Caliper is a threshold on the maximum allowed e(X) gap between treated and control in a pair. If the nearest control is too far, no pair is formed and the treated unit is dropped. Protection against bad matches in regions of broken overlap.

IPTW: reweighting instead of matching

An alternative to matching is inverse probability of treatment weighting (IPTW). Idea: reweight observations to produce a 'pseudo-population' where treatment is distributed as in an RCT - independent of X. Treated units with small e(X) (rarely treated but treated) get larger weights. Controls with large e(X) - also larger weights. The IPTW ATE formula is elementary, and unlike matching it does not require pair construction.

Intuition: the first term recovers E[Y(1)] - the mean potential outcome under treatment - by reweighting treated units. The second term recovers E[Y(0)]. Every treated unit 'represents' all similar units in the population, and the weight 1/e(X_i) reflects how many it represents.

If e(X) approaches 0 or 1, weights explode and ATE estimates become extremely unstable. This is positivity (overlap) violation. Symptoms: a few observations have weights tens of times higher than the rest, ATE swings wildly when adding one observation. Fixes: trimming (drop units with e(X) outside [0.05, 0.95]), stabilized weights, or different methods (DR, see the next concept).

Stabilized weights of Robins, Hernan, Brumback (2000): w_i = T_i * P(T=1) / e(X_i) + (1-T_i) * P(T=0) / (1-e(X_i)). The numerator P(T=1) or P(T=0) - marginal probability - stabilizes variance without changing the point estimate. Standard in epidemiology.

ML connection: unbiased offline evaluation for recommenders

Schnabel et al. 2016: IPTW against missing-not-at-random

Netflix wants to evaluate a new recommender, but logs are biased: a user only sees and rates items selected by the old algorithm. This is missing-not-at-random. Schnabel et al. proposed: for each (user, item) interaction estimate the propensity (probability of item being shown to user) via GBM. Weights 1/propensity give an unbiased estimate of the new model even on biased logs. Same idea - IPS-based bandit feedback debiasing by Joachims and Swaminathan.

Any production system creates biased logs: it shows what it considers relevant and collects data only on what was shown. Same structure as selection on observables. IPTW with propensity = probability of showing item gives an unbiased estimate of new-model metrics. YouTube, Spotify, Amazon apply these techniques for debiasing offline evaluation - to avoid launching every idea to production just to measure it.

What happens to the IPTW estimate when e(X) is near 0 for some units?

When e(X) is near 0, a single rare treated unit gets a huge weight and dominates the estimate. This is a symptom of positivity (overlap) violation. Fixes: trimming, stabilized weights, doubly robust estimation.

Doubly robust and Double ML: insurance against model errors

Matching and IPTW require a correct propensity model. Outcome regression (estimating E[Y|T, X]) requires a correct outcome model. Each model can be wrong. A doubly robust (DR) estimator combines both: it is consistent if at least ONE of the two models is correct. This dramatic property - double insurance against misspecification - made DR the standard in modern causal ML.

The structure of AIPTW (Augmented IPTW) is transparent: first part is the outcome regression estimate (mu_1 - mu_0). Second part is the IPW correction of residuals. If the outcome model is correct, residuals are mean-zero and the estimate reduces to outcome regression. If the propensity model is correct, the IPW part does the work. If both are correct, the estimator is efficient (achieves the Cramer-Rao lower bound).

Double Machine Learning (DML), Chernozhukov et al. (2018): the theoretical foundation for using arbitrary ML models to estimate nuisance functions (mu and e) in causal inference. Key ingredients: (1) orthogonal moment condition; (2) cross-fitting. Under these conditions the ATE estimate is sqrt(N)-consistent and asymptotically normal, even when mu and e are estimated slower than N^{-1/2}.

Uber has openly described its causal platform (CausalML library): DML, X-learner, R-learner for uplift estimation under promo codes, pricing, product features. Nuisance functions estimated by gradient boosting. Cross-fitting allows using LightGBM with thousands of features without bias. Similar: EconML from Microsoft and DoubleML in R/Python.

Double robustness is insurance against wrong functional form of mu and e, not against omitted variables. If an important confounder is unobserved, neither mu nor e is correct - and DR is also biased. Also: DR loses double robustness in small samples where both models are poorly estimated.

Comparing methods on the same data

Which estimator to pick

Cholesterol study, 10000 patients, 47 covariates. Naive difference: -14 mg/dL (biased, no age correction). Outcome regression (linear): -19 mg/dL. IPTW with logistic propensity: -18 mg/dL. IPTW with GBM propensity: -21 mg/dL. AIPTW (linear outcome + GBM propensity): -22 mg/dL. DML with XGBoost for both nuisance functions: -22 mg/dL with CI [-24, -20]. DML matches the AIPTW point estimate with better standard errors via cross-fitting and orthogonality.

What does 'doubly robust' mean for AIPTW estimators?

Doubly robust means misspecification of one nuisance function does not kill consistency - the other compensates. Mathematically from an orthogonal moment condition. If BOTH are wrong, DR is biased too. If at least one is correct, DR is consistent.

Where propensity score leads

Propensity score is the first practical technique for applying the Rubin model. The next methods are for situations where ignorability fails or additional sources of variation are needed.

Confounders and Simpson's Paradox — Propensity score methods directly attack the confounder problem - balance covariates to simulate randomization
Randomized Controlled Trials — Propensity methods approximate the statistical balance that randomization achieves mechanically

Key ideas

Propensity score e(X) = P(T=1|X) compresses the entire covariate vector to one number
Balancing property: X ⊥ T | e(X) - conditioning on the scalar is equivalent
Matching: for each treated unit find the nearest control in propensity space
IPTW: weights 1/e(X) and 1/(1-e(X)) give an unbiased ATE
Positivity violation (e(X) -> 0 or 1) kills IPTW via extreme weights
Doubly robust combines outcome regression and IPTW, consistent under correctness of either model
DML with cross-fitting allows arbitrary ML models for nuisance functions

Вопросы для размышления

Why is a high AUC of the propensity model a bad sign, not a good one?
What is the principled difference between matching and IPTW given the same e(X)?
In which situations does the DR estimator fail to rescue from bias, despite double robustness?
Why can ML not be used for e(X) in high-dimensional X without cross-fitting?

Связанные уроки

stat-41-causal-potential-outcomes — Propensity score relies on ignorability
stat-20-causal — Baseline understanding of confounding
stat-38-logistic-regression — Logistic regression is the classical model for e(X)
stat-46-causal-sensitivity — Sensitivity analysis for ignorability violations
ml-10-logistic-regression