Causal Calculus
Causal Representation Learning
96% of generalization algorithms fail under domain shift - they learn correlations, not causes (Arjovsky 2019, IRM). Causal representation learning (CRL) reframes the problem: find latent variables $S$ that preserve causal structure. LiNGAM and NOTEARS recover this structure from data.
- Domain generalization: transferring a model across clinics with different treatment protocols
- Biomarker discovery: which molecular variables are causally linked to disease?
- Disentangled representations: independent latent factors of image generation
- Gene regulatory networks: NOTEARS for recovering transcriptional networks
- Distribution-shift robustness in autonomous agents
Цели урока
- Understand Hyvarinen's identifiability theorem for non-Gaussian sources
- Apply DirectLiNGAM to recover causal order from a linear non-Gaussian model
- Use the NOTEARS penalty $h(B) = \mathrm{tr}(e^{B\circ B}) - d = 0$ for continuous DAG optimization
Предварительные знания
- Structural causal models and DAGs
- Independent component analysis (ICA)
- Matrix exponential and its properties
Identifiability theorem: non-Gaussian sources
CRL: given $X = g(S, \varepsilon)$, find $h$ such that $h(X) \approx \Pi \Lambda S$ (recovering $S$ up to permutation and scaling). Hyvarinen-Morioka theorem: if the components of $S$ are independent and non-Gaussian, $h$ is uniquely determined. Gaussian sources are not identifiable: any orthogonal mixing yields another Gaussian.
LiNGAM: linear non-Gaussian SCM
Shimizu (2006): a linear SCM $X = BX + e$ with independent non-Gaussian noise terms $e_i$ is uniquely identifiable. The causal order is recovered via ICA decomposition. DirectLiNGAM runs in $O(d^3)$: greedy selection of root variables by maximum non-Gaussianity of residuals.
NOTEARS: continuous DAG optimization
Zheng (2018): the acyclicity constraint $h(B) = \mathrm{tr}(e^{B\circ B}) - d = 0$ is continuous and differentiable. The problem $\min_B \|X - XB^T\|_F^2 + \lambda\|B\|_1$ subject to $h(B) = 0$ is solved with an augmented Lagrangian instead of searching over $2^{d(d-1)}$ DAGs.
Identifiability: Non-Gaussian Sources
Arjovsky, Bengio, and colleagues in 2019 showed in IRM that 96% of generalization algorithms fail under domain shift due to statistical rather than causal correlations. Causal representation learning (CRL) formalizes the task: given X = g(S, epsilon), find h such that h(X) recovers S up to permutation and scaling (Hyvarinen-Morioka theorem).
What does Hyvarinen's identifiability theorem guarantee for ICA?
LiNGAM and NOTEARS: DAG Discovery
Shimizu et al. (2006) proved that a linear non-Gaussian SCM is uniquely identifiable: DirectLiNGAM finds causal order in O(d^3) operations. Zheng et al. (2018) NOTEARS: the acyclicity constraint h(B)=tr(e^{B circ B})-d=0 is continuous, enabling gradient descent instead of combinatorial search over 2^{d(d-1)} DAGs.
What is the key contribution of NOTEARS (Zheng 2018) to DAG discovery?
Checking acyclicity via $h(B)$
DAG $W = \begin{pmatrix} 0 & 0.7 \\ 0 & 0 \end{pmatrix}$: $h(W) \approx 0$. Cyclic graph $W = \begin{pmatrix} 0 & 0.7 \\ 0.4 & 0 \end{pmatrix}$: $h(W) > 0$. The penalty continuously separates DAGs from cyclic graphs.
Итоги
- CRL identifies latent variables up to permutation/scaling when sources are independent and non-Gaussian (Hyvarinen's theorem)
- LiNGAM recovers causal order via ICA; DirectLiNGAM runs in $O(d^3)$
- NOTEARS converts acyclicity $h(B)=0$ into a smooth constraint, enabling gradient-based DAG learning
Connections to other topics
Causal representation is the foundation of robust ML: IRM (Arjovsky 2019) exploits the invariance of causal features across domains. DAG discovery algorithms (NOTEARS, GES, PC) are applied in bioinformatics to recover gene regulatory networks from expression data.
- Related topics — extends
Вопросы для размышления
- Why are Gaussian sources not identifiable in ICA? What fundamentally changes with non-Gaussian distributions?
- Does NOTEARS guarantee finding a DAG or only a local minimum? What are the practical consequences?
- IRM seeks features that are invariant across environments. How does this relate to causality: which features should be invariant and why?