Statistics

Probabilistic Graphical Models

Language, images, molecules, social networks - everything is a graph. Probabilistic graphical models are a unified mathematical language for reasoning about dependencies: from image pixels to brain neurons.

Error-correcting codes (LDPC): loopy BP approaches the Shannon limit in 5G mobile networks
Genomics: Gaussian GGMs build gene expression networks - who regulates whom
Computer vision: CRFs (conditional random fields) refine semantic segmentation boundaries

Предварительные знания

Causal Inference

Markov Random Fields: Factorization

Ernst Ising's 1925 model of 10^23 spins, each coupled only to its neighbours, is the same machinery DeepLab now uses for image segmentation and 5G LDPC codes use for error correction. A Markov random field factorizes the joint distribution over graph cliques.

**Connection to Deep Learning:** Conditional Random Fields (CRF) are used in NLP (named entity recognition) and Computer Vision (semantic segmentation). DeepLab adds a Dense CRF on top of a CNN to refine object boundaries. Restricted Boltzmann Machines (RBMs) are bipartite MRFs and the building blocks of Deep Belief Networks. Diffusion models are connected to energy-based models through score matching.

In an MRF, the partition function Z = ∑_X ∏_C ψ_C(X_C). Why is computing Z NP-hard for a general graph?

Directed vs Undirected Graphs

In a **Bayesian Network (DAG):** P(X) = ∏ P(Xᵢ | Pa(Xᵢ)). Independence is read via **d-separation**: Xᵢ ⊥ Xⱼ | Z if every path between them is blocked by Z. Three patterns: chain (A→B→C: A⊥C|B), fork (A←B→C: A⊥C|B), collider (A→B←C: A⊥C but A∦C|B!). In an **MRF**, independence follows from the global Markov property: X_A ⊥ X_B | X_S if S separates A from B in the graph. The **Markov blanket** in a DAG: parents + children + co-parents.

**I-map and Perfect map:** a graph G is an I-map of P if all independencies encoded in G hold in P (the graph may miss some of P's independencies). G is a perfect map if they match exactly. Most real distributions do not have a perfect map - a choice between DAG and MRF must be made. Structure-learning algorithms (PC, FCI, GES) search for the G that best approximates the I-map of the data.

In a DAG: Disease -> Symptom <- Other Disease. A patient presents with the symptom. Test results confirm the first disease. What happens to the probability of the second disease?

Loopy BP and Gaussian Graphical Models

**Belief Propagation (BP)** performs exact inference on trees by message passing: μ_{i→j}(xⱼ) = ∑_{xᵢ} ψᵢ(xᵢ) ψᵢⱼ(xᵢ,xⱼ) ∏_{k∈N(i)\j} μ_{k→i}(xᵢ). On trees this is exact and runs in O(n·|X|). **Loopy BP** applies the same equations to general graphs. It is approximate (convergence not guaranteed) but works remarkably well in practice (LDPC error-correcting codes, turbo codes). **Gaussian Graphical Model:** X ~ N(0,Σ), precision matrix Ω = Σ⁻¹ - a zero entry Ωᵢⱼ = 0 means Xᵢ ⊥ Xⱼ | all others.

**When loopy BP works and when it does not:** it performs well on sparse graphs with long cycles (LDPC codes approach the Shannon limit), and when interactions are weak. It fails on short cycles (triangles) and strong dependencies. Alternatives: expectation propagation (EP), variational message passing, tree-reweighted BP (TRW-BP). In Deep Learning, loopy BP is implemented as message-passing neural networks (MPNNs).

Graphical Lasso found Ω₁₃ = 0 (zero entry in the precision matrix). What does this mean?

Key ideas

MRF: P(X) = (1/Z) ∏_C ψ_C(X_C); computing Z is NP-hard for general graphs
DAG: d-separation; collider A->B<-C creates dependence A∦C|B (explaining away)
MRF: X_A ⊥ X_B | X_S when S separates them; Markov blanket = minimal feature set
Loopy BP: exact on trees, approximate on general graphs
GGM: Ωᵢⱼ = 0 <-> Xᵢ ⊥ Xⱼ | others; Graphical Lasso gives sparse Ω

Graphical models and the course

Graphical models unite Bayesian statistics, information theory, and algorithms. HMMs are special cases of directed chains. CRFs are conditioned MRFs. Variational CAVI is the variational analogue of belief propagation.

Variational Bayesian Inference — CAVI is the variational analogue of BP; mean-field is a fully disconnected graph
Causal Inference — A DAG is a causal graphical model; d-separation determines identifiability of causal effects

Вопросы для размышления

Explain why conditioning on a collider (B in A->B<-C) creates dependence between A and C. Give a real-world example where this leads to incorrect conclusions in data analysis.
Graphical Lasso minimizes −log P(X|Ω) + λ‖Ω‖₁. How does the choice of λ affect graph sparsity? How does this connect to the trade-off between interpretability and fit quality?
Loopy BP is used in turbo codes for error correction. Why does it work well there but is unreliable on dense graphs with strong interactions? What makes the cycle structure of LDPC codes special?

Связанные уроки

ml-15-naive-bayes