Statistical Physics and Probability
The Metropolis-Hastings MCMC algorithm was invented in 1953 to simulate nuclear weapons. Today it underlies Bayesian inference in statistics, diffusion models in generative AI, and simulated annealing optimization. All this computational power follows from one equation: the Boltzmann distribution.
- Simulated annealing: minimizing a loss function via random walks with temperature parameter $T$ - a direct analogy to cooling physical systems. As $T \to 0$, the system settles into the global minimum with probability $\to 1$.
- MCMC in Bayesian ML: sampling from the posterior $P(\theta | D) \propto P(D|\theta)P(\theta)$ via the MH algorithm or NUTS. Stan and PyMC implement HMC (Hamiltonian Monte Carlo) - a method inspired by molecular dynamics.
- Diffusion models (DDPM): the forward process adds Gaussian noise (analogous to thermal fluctuations); the reverse process restores the signal (analogous to cooling). The score function - gradient of the log-density - is analogous to a physical force.
Цели урока
- Derive the Boltzmann distribution from the maximum entropy principle
- Analyze the Ising model: phase transitions and critical behavior
- Justify the correctness of the Metropolis algorithm through detailed balance
Предварительные знания
- Information-theoretic methods: entropy and KL divergence
- Markov chains and their stationary distributions
- Statistical mechanics basics (physical concepts: energy, temperature)
Boltzmann distribution and maximum entropy
The Boltzmann distribution $P(s) = e^{-E(s)/T}/Z$, where $Z = \sum_s e^{-E(s)/T}$ is the partition function, $T$ is temperature. Derivation via maximum entropy $H(P)$ subject to fixed mean energy $\mathbb{E}[E] = U$: with Lagrange multiplier $\beta = 1/T$ one gets $P(s) \propto e^{-\beta E(s)}$. As $T \to 0$: $P$ concentrates on the energy minimum. As $T \to \infty$: uniform distribution - maximum entropy.
Metropolis-Hastings algorithm: propose $s' \sim q(s'|s)$; accept with probability $\alpha = \min(1, P(s')q(s|s')/(P(s)q(s'|s)))$. The detailed balance condition $P(s)\alpha(s\to s') q(s'|s) = P(s')\alpha(s'\to s) q(s|s')$ guarantees that $P$ is the stationary distribution. The Markov chain converges to $P$ under ergodicity (irreducibility + aperiodicity).
MCMC has a mixing problem: the chain can get stuck in one mode of a multimodal distribution. For the Ising model below $T_c$, the chain takes exponentially long to switch between $M > 0$ and $M < 0$ states. Solutions: parallel tempering (chains at different $T$ exchange states), HMC (uses gradient for efficient exploration), NUTS (automatic step size tuning).
Nicholas Metropolis and colleagues published the algorithm in 1953 for calculati
Nicholas Metropolis and colleagues published the algorithm in 1953 for calculations at Los Alamos. Wilfred Hastings generalized it in 1970 to asymmetric proposals. Wilhelm Lenz proposed the Ising model in 1920; his student Ernst Ising solved the 1D case in his 1924 thesis, incorrectly concluding there was no phase transition. Lars Onsager solved the 2D case in 1944, proving the existence of a transition.
Boltzmann-Gibbs distribution
In 1872 Ludwig Boltzmann captured 10^23 gas molecules with one formula: state probability scales like exp(-E/kT). The same exponential drives energy-based models, the softmax layer in neural networks, and MCMC samplers in modern probabilistic programming.
As beta goes to infinity, the Boltzmann distribution tends to...
The Ising model and phase transition
Two-dimensional Ising model: spins sigma_i in {-1,+1} on a lattice, energy E = -J sum sigma_i sigma_j over neighbours. Above a critical betaJ (about 0.4407) the spontaneous magnetisation jumps to a non-zero value. This is a canonical second-order phase transition with a diverging correlation length.
In the 2D Ising model at beta > betac, the magnetisation m...
Metropolis-Hastings as MCMC
The Metropolis algorithm (1953) samples from a distribution P via a random walk: propose x', accept with probability min(1, P(x')/P(x)). Only the ratio P(x')/P(x) is needed; Z cancels. The chain converges to an arbitrary P through local moves, without computing the partition function.
Why does Metropolis-Hastings work without knowing the normaliser Z?
Ising model and phase transition
Ising model on $\mathbb{Z}^2$: spins $\sigma_i \in \{\pm 1\}$, $E = -J\sum_{\langle i,j \rangle} \sigma_i \sigma_j - h\sum_i \sigma_i$. For $T > T_c$ (Ising critical temperature $T_c \approx 2.269 J/k_B$ for the square lattice): magnetization $M = \mathbb{E}[\sigma_i] = 0$ (paramagnet). For $T < T_c$: $M \neq 0$ (ferromagnet). Second-order phase transition. In ML: Restricted Boltzmann Machines (RBMs) are a bipartite version of the Ising model.
Итоги
- The Boltzmann distribution $P \propto e^{-E/T}$ maximizes entropy subject to a constrained mean energy.
- The Ising model exhibits a phase transition at the critical temperature.
- The MH algorithm constructs a Markov chain with the desired stationary distribution via detailed balance - the foundation of MCMC in Bayesian ML.
Connections to other topics
Diffusion models (Score SDE): the score function $\nabla \log p(x)$ is analogous to a physical force; the reverse process restores data like cooling. Information theory (prob-25): free energy $F = -T\log Z$ is a KL divergence to the scattered measure. Bayesian networks (prob-27): variational inference minimizes KL, analogous to minimizing free energy in physics.
- Prob 25 — related
- Prob 27 — related
Вопросы для размышления
- In simulated annealing, the temperature $T(t)$ decreases on a schedule. Cooling too fast gets stuck in local minima; too slow is inefficient. What mathematical criterion (related to the mixing time of the Markov chain) guarantees convergence to the global minimum?