Causal Calculus

Causal Representation Learning

96% of generalization algorithms fail under domain shift - they learn correlations, not causes (Arjovsky 2019, IRM). Causal representation learning (CRL) reframes the problem: find latent variables $S$ that preserve causal structure. LiNGAM and NOTEARS recover this structure from data.

Domain generalization: transferring a model across clinics with different treatment protocols
Biomarker discovery: which molecular variables are causally linked to disease?
Disentangled representations: independent latent factors of image generation
Gene regulatory networks: NOTEARS for recovering transcriptional networks
Distribution-shift robustness in autonomous agents

Цели урока

Understand Hyvarinen's identifiability theorem for non-Gaussian sources
Apply DirectLiNGAM to recover causal order from a linear non-Gaussian model
Use the NOTEARS penalty $h(B) = \mathrm{tr}(e^{B\circ B}) - d = 0$ for continuous DAG optimization

Предварительные знания

Structural causal models and DAGs
Independent component analysis (ICA)
Matrix exponential and its properties

Identifiability theorem: non-Gaussian sources

CRL: given $X = g(S, \varepsilon)$, find $h$ such that $h(X) \approx \Pi \Lambda S$ (recovering $S$ up to permutation and scaling). Hyvarinen-Morioka theorem: if the components of $S$ are independent and non-Gaussian, $h$ is uniquely determined. Gaussian sources are not identifiable: any orthogonal mixing yields another Gaussian.

LiNGAM: linear non-Gaussian SCM

Shimizu (2006): a linear SCM $X = BX + e$ with independent non-Gaussian noise terms $e_i$ is uniquely identifiable. The causal order is recovered via ICA decomposition. DirectLiNGAM runs in $O(d^3)$: greedy selection of root variables by maximum non-Gaussianity of residuals.

NOTEARS: continuous DAG optimization

Zheng (2018): the acyclicity constraint $h(B) = \mathrm{tr}(e^{B\circ B}) - d = 0$ is continuous and differentiable. The problem $\min_B \|X - XB^T\|_F^2 + \lambda\|B\|_1$ subject to $h(B) = 0$ is solved with an augmented Lagrangian instead of searching over $2^{d(d-1)}$ DAGs.

Identifiability: Non-Gaussian Sources

Arjovsky, Bengio, and colleagues in 2019 showed in IRM that 96% of generalization algorithms fail under domain shift due to statistical rather than causal correlations. Causal representation learning (CRL) formalizes the task: given X = g(S, epsilon), find h such that h(X) recovers S up to permutation and scaling (Hyvarinen-Morioka theorem).

What does Hyvarinen's identifiability theorem guarantee for ICA?

LiNGAM and NOTEARS: DAG Discovery

Shimizu et al. (2006) proved that a linear non-Gaussian SCM is uniquely identifiable: DirectLiNGAM finds causal order in O(d^3) operations. Zheng et al. (2018) NOTEARS: the acyclicity constraint h(B)=tr(e^{B circ B})-d=0 is continuous, enabling gradient descent instead of combinatorial search over 2^{d(d-1)} DAGs.

What is the key contribution of NOTEARS (Zheng 2018) to DAG discovery?

Checking acyclicity via $h(B)$

DAG $W = \begin{pmatrix} 0 & 0.7 \\ 0 & 0 \end{pmatrix}$: $h(W) \approx 0$. Cyclic graph $W = \begin{pmatrix} 0 & 0.7 \\ 0.4 & 0 \end{pmatrix}$: $h(W) > 0$. The penalty continuously separates DAGs from cyclic graphs.

Итоги

CRL identifies latent variables up to permutation/scaling when sources are independent and non-Gaussian (Hyvarinen's theorem)
LiNGAM recovers causal order via ICA; DirectLiNGAM runs in $O(d^3)$
NOTEARS converts acyclicity $h(B)=0$ into a smooth constraint, enabling gradient-based DAG learning

Connections to other topics

Causal representation is the foundation of robust ML: IRM (Arjovsky 2019) exploits the invariance of causal features across domains. DAG discovery algorithms (NOTEARS, GES, PC) are applied in bioinformatics to recover gene regulatory networks from expression data.

Related topics — extends

Вопросы для размышления

Why are Gaussian sources not identifiable in ICA? What fundamentally changes with non-Gaussian distributions?
Does NOTEARS guarantee finding a DAG or only a local minimum? What are the practical consequences?
IRM seeks features that are invariant across environments. How does this relate to causality: which features should be invariant and why?