Causal Calculus

Causal Inference with Text and NLP

Clinical notes are confounders: physicians write more detailed records for sicker patients, so 'note length' correlates with poor outcomes. Without controlling for text, treatment effect estimates are biased. Causal NLP solves this: using language models to eliminate text-induced confounding.

Healthcare: clinical notes as a confounder when estimating treatment effects
Social science: news sentiment as a confounder in market behavior studies
Advertising: ad copy as a confounder when estimating click-through rates
Legal research: language of court rulings and racial bias
Algorithm auditing: discrimination through proxy features embedded in text

Цели урока

Use BERT embeddings as a proxy for text confounders in causal analysis
Apply the doubly-robust DR-Learner to estimate CATE
Interpret the E-value as a measure of robustness to unmeasured confounders

Предварительные знания

Causal identification and the backdoor criterion
Propensity score estimation and IPW
Transformers and BERT: contextual embeddings

Text as a confounder

When unstructured text $T$ (clinical notes, product descriptions) influences both treatment $D$ and outcome $Y$, it is a confounder. Controlling for $T$ via bag-of-words loses semantics. BERT embeddings $\phi(T)$ serve as a rich proxy for $T$, enabling standard causal methods.

E-value: robustness to unmeasured confounders

E-value (VanderWeele 2017): the minimum strength of an unmeasured confounder needed to explain away the observed association. $E = RR + \sqrt{RR(RR-1)}$ for relative risk $RR$. A large E-value signals robustness: the confounder would need to be very strong to explain the result.

Controlling for text confounders via neural embeddings works only under the sufficiency assumption: $T \perp\!\!\!\perp (D, Y) | \phi(T)$. This is untestable; quality depends on the expressiveness of the embeddings.

Text as Confounder Proxy

Roberts et al. (2020) used topic models on drug reviews to control for 'patient health' (an unobserved confounder) when estimating drug side effects. Veitch et al. (2020): BERT embeddings as propensity model features reduce bias by 40-60% compared to naive OLS when text partially observes the confounder.

Why is the doubly-robust estimator preferred over simple IPW?

E-value and Sensitivity to Hidden Confounding

VanderWeele and Ding (2017) introduced the E-value: the minimum strength of an unobserved confounder (measured in RR) needed to explain away the observed effect. E-value = RR + sqrt(RR(RR-1)). For text data: analyze sensitivity to missing words or topics. A large E-value indicates the result is hard to explain by hidden confounding.

What does the E-value (VanderWeele & Ding) measure?

BERT propensity score for text confounding

Propensity model: $e(T) = P(D=1|T) = \sigma(w^T \phi_{BERT}(T))$. IPW estimate: $\hat{\tau}_{IPW} = \frac{1}{n}\sum_i \frac{D_i Y_i}{e(T_i)} - \frac{(1-D_i)Y_i}{1-e(T_i)}$. The doubly-robust estimator remains consistent when either the outcome model or the propensity model is correctly specified.

Итоги

Text as confounder: BERT embeddings $\phi(T)$ used as proxy for adjustment in IPW and DR estimators
Doubly-robust DR estimator is consistent when at least one of the two models (propensity or outcome) is correctly specified
E-value measures the minimum confounder strength to explain away the result: higher E-value means greater robustness

Connections to other topics

Causal NLP bridges neural language models and Rubin's potential outcomes. Related methods: synthetic control with text data, causal inference for recommendation systems, debiasing language models through causal auditing.

Related topics — extends

Вопросы для размышления

When are BERT embeddings insufficient as a proxy for a confounder? Give an example where semantics fail to capture the relevant confounding variable.
Double robustness: if both models (propensity and outcome) are misspecified, the DR estimator is inconsistent. How would you test their correctness?
An E-value of 2.5 means: a confounder must increase both treatment probability and outcome risk by 2.5-fold to explain the result. How would you interpret this in a clinical study?