Causal Calculus

Causal Representation Learning

96% of generalization algorithms fail under domain shift - they learn correlations, not causes (Arjovsky 2019, IRM). Causal representation learning (CRL) reframes the problem: find latent variables $S$ that preserve causal structure. LiNGAM and NOTEARS recover this structure from data.

  • Domain generalization: transferring a model across clinics with different treatment protocols
  • Biomarker discovery: which molecular variables are causally linked to disease?
  • Disentangled representations: independent latent factors of image generation
  • Gene regulatory networks: NOTEARS for recovering transcriptional networks
  • Distribution-shift robustness in autonomous agents

Цели урока

  • Understand Hyvarinen's identifiability theorem for non-Gaussian sources
  • Apply DirectLiNGAM to recover causal order from a linear non-Gaussian model
  • Use the NOTEARS penalty $h(B) = \mathrm{tr}(e^{B\circ B}) - d = 0$ for continuous DAG optimization

Предварительные знания

  • Structural causal models and DAGs
  • Independent component analysis (ICA)
  • Matrix exponential and its properties

Identifiability theorem: non-Gaussian sources

CRL: given $X = g(S, \varepsilon)$, find $h$ such that $h(X) \approx \Pi \Lambda S$ (recovering $S$ up to permutation and scaling). Hyvarinen-Morioka theorem: if the components of $S$ are independent and non-Gaussian, $h$ is uniquely determined. Gaussian sources are not identifiable: any orthogonal mixing yields another Gaussian.

LiNGAM: linear non-Gaussian SCM

Shimizu (2006): a linear SCM $X = BX + e$ with independent non-Gaussian noise terms $e_i$ is uniquely identifiable. The causal order is recovered via ICA decomposition. DirectLiNGAM runs in $O(d^3)$: greedy selection of root variables by maximum non-Gaussianity of residuals.

NOTEARS: continuous DAG optimization

Zheng (2018): the acyclicity constraint $h(B) = \mathrm{tr}(e^{B\circ B}) - d = 0$ is continuous and differentiable. The problem $\min_B \|X - XB^T\|_F^2 + \lambda\|B\|_1$ subject to $h(B) = 0$ is solved with an augmented Lagrangian instead of searching over $2^{d(d-1)}$ DAGs.

Identifiability: Non-Gaussian Sources

Arjovsky, Bengio, and colleagues in 2019 showed in IRM that 96% of generalization algorithms fail under domain shift due to statistical rather than causal correlations. Causal representation learning (CRL) formalizes the task: given X = g(S, epsilon), find h such that h(X) recovers S up to permutation and scaling (Hyvarinen-Morioka theorem).

What does Hyvarinen's identifiability theorem guarantee for ICA?

LiNGAM and NOTEARS: DAG Discovery

Shimizu et al. (2006) proved that a linear non-Gaussian SCM is uniquely identifiable: DirectLiNGAM finds causal order in O(d^3) operations. Zheng et al. (2018) NOTEARS: the acyclicity constraint h(B)=tr(e^{B circ B})-d=0 is continuous, enabling gradient descent instead of combinatorial search over 2^{d(d-1)} DAGs.

What is the key contribution of NOTEARS (Zheng 2018) to DAG discovery?

Checking acyclicity via $h(B)$

DAG $W = \begin{pmatrix} 0 & 0.7 \\ 0 & 0 \end{pmatrix}$: $h(W) \approx 0$. Cyclic graph $W = \begin{pmatrix} 0 & 0.7 \\ 0.4 & 0 \end{pmatrix}$: $h(W) > 0$. The penalty continuously separates DAGs from cyclic graphs.

Итоги

  • CRL identifies latent variables up to permutation/scaling when sources are independent and non-Gaussian (Hyvarinen's theorem)
  • LiNGAM recovers causal order via ICA; DirectLiNGAM runs in $O(d^3)$
  • NOTEARS converts acyclicity $h(B)=0$ into a smooth constraint, enabling gradient-based DAG learning

Connections to other topics

Causal representation is the foundation of robust ML: IRM (Arjovsky 2019) exploits the invariance of causal features across domains. DAG discovery algorithms (NOTEARS, GES, PC) are applied in bioinformatics to recover gene regulatory networks from expression data.

  • Related topics — extends

Вопросы для размышления

  • Why are Gaussian sources not identifiable in ICA? What fundamentally changes with non-Gaussian distributions?
  • Does NOTEARS guarantee finding a DAG or only a local minimum? What are the practical consequences?
  • IRM seeks features that are invariant across environments. How does this relate to causality: which features should be invariant and why?
Causal Representation Learning

0

1

Sign In