Causal Calculus
Transportability and Selection Diagrams
2011. A statin study showed a 30% reduction in heart attack mortality in the USA. Kenya's Ministry of Health asks: does this apply to us? Demographics differ, diet differs, baseline cholesterol differs. Before Bareinboim-Pearl the only answer was 'probably'. After 2011 - a precise algorithm: if the selection diagram allows eliminating S-variables via do-calculus, transfer is identifiable. If not - provably impossible.
- **Clinical trials:** the FDA requires studying how data from one country generalizes to another - transportability provides a formal tool instead of expert opinion
- **Federated ML in medicine:** hospitals cannot share patient data (GDPR), but can exchange summary statistics - the mSBD criterion determines whether that is sufficient for causal inference
- **Algorithmic fairness:** transferring a decision from one demographic group to another is a special case of transportability, where S-nodes correspond to group attributes
Предварительные знания
External Validity
2011. Bareinboim and Pearl formalized a question that had haunted clinical epidemiology for decades: a hypertension drug trial was conducted in US hospitals. Can the results be applied in Kenya? Demographics differ (age, genetics), treatment protocols differ, even blood pressure measurement tools differ. Intuition says correction is needed. But how? And when is it fundamentally impossible?
External validity asks: can the causal effect P*(y|do(x)) be identified for a target population from source data P(.) plus the target distribution P*(.)? Note the word 'identified' - there exist configurations where transfer is mathematically impossible, and no amount of source data helps. Pearl and Bareinboim gave the exact criterion.
Three levels of the transfer problem: 1. S-admissibility - simple transfer through variables that differ 2. Transportability - the full algorithm via selection diagrams and do-calculus 3. Meta-transportability - multiple sources simultaneously. This lesson builds bottom-up.
A clinical RCT was conducted in the USA. Which statement correctly characterizes the transfer problem to Kenya?
Selection Diagrams
A selection diagram extends a DAG to two populations by adding S-nodes (selection nodes) - special variables with no parents in the diagram. An S-node points to variable V if the distribution of V differs between source and target. Source: S=0, target: S=1. Formally, S_i -> V_i means P(V_i | pa(V_i)) != P*(V_i | pa(V_i)).
Do-calculus in selection diagrams applies Pearl's three rules augmented with S-variables. The transportability criterion: P*(y|do(x)) is identifiable if and only if do-calculus can eliminate all S-variables from the expression. If an S-node lies on a path from X to Y not blocked by do(X), transfer is impossible.
Three causal paths and S-nodes: 1. S on the X->Y path (treatment mechanism differs) - transfer is impossible without additional data 2. S on confounder Z - correct via P*(Z) 3. S on a collider - typically harmless, does not block transfer. The topological position of S in the diagram determines transportability.
A selection diagram has S_Z -> Z (age differs) and X -> Y (treatment mechanism is the same). Which formula gives P*(Y|do(X))?
Transport Formula
The T-formula (transport formula) generalizes the ID algorithm to two populations. When P*(y|do(x)) is identifiable in a selection diagram, the algorithm returns an explicit expression in terms of P(.), P*(.), and interventional distributions. For multiple sources, Bareinboim (2016) developed mSBD (meta-synthetic backdoor) - a generalization of backdoor criteria to fusion of multiple datasets.
Meta-analysis via transportability: instead of one source, Pi = {P_1, ..., P_k} is a set of studies, each with partial data. The mSBD criterion: the problem is identifiable if there exists a partition of variables such that each piece can be estimated from some source. Applications: federated causal learning (different hospitals each seeing only their own patients) and clinical trial generalization.
Federated causal learning: when multiple hospitals cannot share raw patient data due to HIPAA/GDPR, the transportability framework allows identifying causal effects through summary statistics only - provided the selection diagram is known. Bareinboim & Pearl (2016) proved necessary and sufficient conditions for such data fusion.
If RCT results are statistically significant they apply to any population
Statistical significance is about internal validity (absence of random error). External validity (transfer) is a separate question that requires analyzing the selection diagram.
An RCT can be completely precise for its own population and completely irrelevant for another. Bareinboim-Pearl provide the formal criterion: transportability is determined by the structure of the causal graph, not by the p-value.
Two sources are available: P_1 (observational, with Z) and P_2 (RCT, without Z). Target: P*(Z) is known. Why can't the ATE from P_2 (RCT) be used directly?
Key ideas
- **Selection diagram:** DAG + S-nodes (S_i -> V_i if P(V_i|pa(V_i)) != P*(V_i|pa(V_i))). Source: S=0, target: S=1. The position of S in the graph determines whether transfer is possible.
- **Transport formula:** P*(y|do(x)) is identifiable if do-calculus eliminates all S-variables. Simplest case: sum_z P(Y|X,z) * P*(z) when S is only on confounder Z.
- **mSBD for meta-analysis:** multiple sources Pi = {P_1,...,P_k} with partial data. The problem is identifiable if a fusion exists where each component can be estimated from some source.
Related topics
Transportability builds on do-calculus and connects to several research directions:
- Do-operator and interventions — Do-calculus is the primary tool for eliminating S-variables
- Counterfactuals — Transportability at Rung 3 requires structural equations
Вопросы для размышления
- An S-node is on X (treatment assignment mechanism differs). What does this mean for the RCT from the source? How does it change the transfer strategy?
- Two hospitals cannot share patient data. What is the minimum summary statistics they must exchange to identify P*(Y|do(X)) in a third hospital?
- In federated learning, models train locally and gradients are aggregated. Does this solve the transportability problem or only the privacy problem?