Causal Calculus
Causal Inference with Text and NLP
Clinical notes are confounders: physicians write more detailed records for sicker patients, so 'note length' correlates with poor outcomes. Without controlling for text, treatment effect estimates are biased. Causal NLP solves this: using language models to eliminate text-induced confounding.
- Healthcare: clinical notes as a confounder when estimating treatment effects
- Social science: news sentiment as a confounder in market behavior studies
- Advertising: ad copy as a confounder when estimating click-through rates
- Legal research: language of court rulings and racial bias
- Algorithm auditing: discrimination through proxy features embedded in text
Цели урока
- Use BERT embeddings as a proxy for text confounders in causal analysis
- Apply the doubly-robust DR-Learner to estimate CATE
- Interpret the E-value as a measure of robustness to unmeasured confounders
Предварительные знания
- Causal identification and the backdoor criterion
- Propensity score estimation and IPW
- Transformers and BERT: contextual embeddings
Text as a confounder
When unstructured text $T$ (clinical notes, product descriptions) influences both treatment $D$ and outcome $Y$, it is a confounder. Controlling for $T$ via bag-of-words loses semantics. BERT embeddings $\phi(T)$ serve as a rich proxy for $T$, enabling standard causal methods.
E-value: robustness to unmeasured confounders
E-value (VanderWeele 2017): the minimum strength of an unmeasured confounder needed to explain away the observed association. $E = RR + \sqrt{RR(RR-1)}$ for relative risk $RR$. A large E-value signals robustness: the confounder would need to be very strong to explain the result.
Controlling for text confounders via neural embeddings works only under the sufficiency assumption: $T \perp\!\!\!\perp (D, Y) | \phi(T)$. This is untestable; quality depends on the expressiveness of the embeddings.
Text as Confounder Proxy
Roberts et al. (2020) used topic models on drug reviews to control for 'patient health' (an unobserved confounder) when estimating drug side effects. Veitch et al. (2020): BERT embeddings as propensity model features reduce bias by 40-60% compared to naive OLS when text partially observes the confounder.
Why is the doubly-robust estimator preferred over simple IPW?
E-value and Sensitivity to Hidden Confounding
VanderWeele and Ding (2017) introduced the E-value: the minimum strength of an unobserved confounder (measured in RR) needed to explain away the observed effect. E-value = RR + sqrt(RR(RR-1)). For text data: analyze sensitivity to missing words or topics. A large E-value indicates the result is hard to explain by hidden confounding.
What does the E-value (VanderWeele & Ding) measure?
BERT propensity score for text confounding
Propensity model: $e(T) = P(D=1|T) = \sigma(w^T \phi_{BERT}(T))$. IPW estimate: $\hat{\tau}_{IPW} = \frac{1}{n}\sum_i \frac{D_i Y_i}{e(T_i)} - \frac{(1-D_i)Y_i}{1-e(T_i)}$. The doubly-robust estimator remains consistent when either the outcome model or the propensity model is correctly specified.
Итоги
- Text as confounder: BERT embeddings $\phi(T)$ used as proxy for adjustment in IPW and DR estimators
- Doubly-robust DR estimator is consistent when at least one of the two models (propensity or outcome) is correctly specified
- E-value measures the minimum confounder strength to explain away the result: higher E-value means greater robustness
Connections to other topics
Causal NLP bridges neural language models and Rubin's potential outcomes. Related methods: synthetic control with text data, causal inference for recommendation systems, debiasing language models through causal auditing.
- Related topics — extends
Вопросы для размышления
- When are BERT embeddings insufficient as a proxy for a confounder? Give an example where semantics fail to capture the relevant confounding variable.
- Double robustness: if both models (propensity and outcome) are misspecified, the DR estimator is inconsistent. How would you test their correctness?
- An E-value of 2.5 means: a confounder must increase both treatment probability and outcome risk by 2.5-fold to explain the result. How would you interpret this in a clinical study?