Causal Calculus
Bayesian Networks for Causality
Why does regressing Y on X fail to give the causal effect even when the model is statistically significant? What structure must be added to obtain causality from observational data?
- **Medical research:** Bayesian networks model how treatments affect biomarkers, separating direct and mediated effects for clinical trial analysis
- **AlphaFold2:** causal structures in evolutionary constraints - separating causal co-evolutions from spurious correlations across protein families
- **Advertising:** Facebook uses causal DAGs to separate the effect of ad exposure from organic demand in conversion lift studies
- **Fault diagnosis:** NASA applies causal DAGs to diagnose spacecraft system failures from telemetry without controlled experiments
Предварительные знания
- Probability theory: conditional independence
- Graph theory: DAG, paths, connectivity
- do-calculus: fundamentals
A causal Bayesian network is a DAG G = (V, E) together with conditional distributions P(Xi | Pa(Xi)) for each node. An edge X -> Y means X is a direct cause of Y. The joint distribution factorizes via the Markov condition. The critical insight: the same joint distribution can arise from multiple DAGs - only interventional distributions distinguish them.
Testability: all conditional independencies predicted by d-separation can be tested statistically. This allows causal assumptions to be checked against data rather than accepted on faith - a fundamental difference from classical regression modeling.
DAG Structure: nodes, edges, factorization
A Bayesian network encodes a joint distribution through a directed acyclic graph (DAG). Nodes are random variables and directed edges express direct causal or statistical dependence. Acyclicity ensures a topological order and a well-defined factorization of the joint density. Pearl and Heckerman showed in the 1990s that DAGs are a universal tool for compact representation of high-dimensional distributions in diagnostics, credit scoring, and genetics.
The topological order is not unique, but the factorization is invariant under any valid order. This allows inference to proceed in any admissible order and lets engines optimize the order for total operation count.
At the implementation level a Bayesian network is stored as an array of conditional distributions (CPTs for discrete or CPDs for continuous variables) and a parent list per node. Storage is O(n * max_parents * max_values^max_parents), which makes networks with tens of thousands of nodes practical under bounded density.
What does acyclicity of a Bayesian network guarantee?
Conditional Independence and d-Separation
The DAG structure imposes conditional independence constraints via path geometry. The d-separation criterion (Pearl 1988) gives an algorithmic rule: if every path between X and Y is blocked by a set Z, then X is independent of Y given Z. This criterion is complete with respect to the class of all distributions compatible with the graph.
Testing d-separation runs in O(V + E) for a pairwise query via breadth-first search with open/closed flags on each edge. This enables structure-discovery algorithms without enumerating distributions.
Conditioning on a collider or its descendant induces a spurious dependence (selection bias). This explains the Berkson paradox: hospital patients appear sicker on one dimension when controlling for another, even with no causal link.
In the graph A -> B <- C, what happens to the dependence between A and C when conditioning on B?
Probabilistic Inference Algorithms
Inference task: compute P(Q | E) for query variables Q given evidence E. Exact inference is NP-hard in general (Cooper 1990), but in graphs with small treewidth it is polynomial via variable elimination or junction trees. For large treewidth approximate methods take over - sampling and variational inference.
Junction tree (Lauritzen-Spiegelhalter 1988) is the canonical exact algorithm. The DAG is converted to a clique tree via moralization and triangulation. Message passing on this tree yields all marginals in a single pass. Complexity is exponential in the largest clique size.
What sets the computational cost of exact inference in a Bayesian network?
Connections to related areas
Causal Bayesian networks bridge graphical probability theory and causal inference.
- do-calculus — Related topic
- Structural Equation Models (SEM) — Related topic
- Causal Discovery — Related topic
- Counterfactual inference — Related topic
Итоги
- Causal DAG G = (V,E) encodes causal assumptions; joint P = prod P(Xi|Pa(Xi)) is the Markov factorization
- d-separation: Z d-separates X and Y if all paths are blocked (chains/forks with node in Z, colliders without nodes from Z)
- Collider X->Z<-Y: marginally independent X and Y become dependent upon conditioning on Z (Berkson's paradox)
- do(X=x) models intervention: removes incoming edges to X, severing backdoor confounding
- Backdoor criterion: P(Y|do(X)) = sum_z P(Y|X,Z)P(Z) when Z blocks all backdoor paths
- All d-separation-predicted independencies are statistically testable - this verifiability is the key advantage over untestable assumptions
In the DAG X -> Z <- Y (collider Z): what happens to the association between X and Y when conditioning on Z?
The collider creates selection bias: X and Y are marginally independent, but conditioning on Z opens the path. Classic example: if Z = 'admitted to university', X = 'intelligent', Y = 'diligent' - among admitted students, intelligence and diligence may appear negatively correlated, even though no such correlation exists in the general population.