Probability Theory
Markov Chains
Google's PageRank is the stationary distribution of a Markov chain on the web graph. The algorithm ranks 130 trillion pages by solving a single matrix equation. MCMC methods used in Bayesian ML to sample from complex posteriors are also Markov chains at their core.
- PageRank (Google): stationary distribution on the web graph ranks 130T pages
- MCMC in Bayesian ML: Metropolis-Hastings and HMC sample from posterior distributions
- Hidden Markov Models: speech recognition in Siri, Google Assistant, and Alexa
- Large language models: next-token prediction is a high-order Markov model
- Sequential A/B testing: Markov-based stopping rules for early experiment termination
- Finance: Vasicek and Hull-White interest rate models as Markov diffusions
Markov Property and Transition Matrix
PageRank - the algorithm Google uses to rank billions of pages - is the stationary distribution of a Markov chain. Every search query in the world implicitly solves the equation **pi = piP**. Andrey Markov invented this structure in 1906 while analyzing the alternation of vowels and consonants in Pushkin's Eugene Onegin. Today it is embedded in every recommendation system, language model, and Reinforcement Learning engine.
The Markov Property
A Markov chain is a sequence of random variables X₀, X₁, X₂, ... with the property that the future depends only on the present, not on the past.
Example: weather. States - sunny (S) and rainy (R). Tomorrow's weather depends only on today's, not on the entire week. Transition matrix:
| From \ To | S (tomorrow) | R (tomorrow) |
|---|---|---|
| S (today) | 0.8 | 0.2 |
| R (today) | 0.4 | 0.6 |
In matrix form, P is a row-stochastic matrix: each element Pᵢⱼ >= 0 and each row sums to 1.
n-step Transitions: Chapman-Kolmogorov
The probability of going from state i to state j in n steps is the (i,j) element of the matrix P^n. The Chapman-Kolmogorov equation:
For the weather example: P^2 gives two-day transition probabilities. As n -> inf, the matrix P^n converges - each row becomes the same vector. That vector is the stationary distribution.
P^n converges regardless of the starting row - at n=50 all rows of the matrix are identical. The initial state is forgotten.
Transition matrix P = [[0.7, 0.3], [0.5, 0.5]]. What is the probability of being in state 1 after 2 steps, starting from state 0?
Stationary Distribution
The stationary distribution pi is a probability vector that the chain does not move. Starting from distribution pi, one step later the distribution is still pi. It is the **left eigenvector** of P for eigenvalue 1.
For the weather example: solve pi*P = pi.
pi = [pi_S, pi_R] pi_S = 0.8*pi_S + 0.4*pi_R pi_R = 0.2*pi_S + 0.6*pi_R pi_S + pi_R = 1 From the first equation: 0.2*pi_S = 0.4*pi_R => pi_S = 2*pi_R Substituting: 2*pi_R + pi_R = 1 => pi_R = 1/3, pi_S = 2/3 pi = [2/3, 1/3] ~= [0.667, 0.333]
The stationary distribution does not depend on the initial state - only on the structure of P. Whether starting from sun or rain, after many steps the distribution is the same.
Existence and Detailed Balance
Perron-Frobenius theorem: every **irreducible** and **aperiodic** finite chain has a unique stationary distribution, and P^n converges to it.
- Irreducible: every state is reachable from every other state
- Aperiodic: no cyclic traps (gcd of all cycle lengths = 1)
The detailed balance condition (reversibility) is a stronger property that guarantees stationarity:
A chain satisfying detailed balance is called reversible. Metropolis-Hastings specifically constructs reversible chains.
PageRank: pi = stationary distribution of the web graph
A random surfer follows links with probability d=0.85 (damping factor), and jumps to a random page with probability 0.15. PageRank is the stationary distribution of this chain:
Power iteration is the fastest way to find pi for large sparse graphs. Google originally applied it to a web graph of billions of pages.
Why does PageRank need a damping factor d=0.85? What happens without it (d=1)?
Ergodicity and MCMC
Bayesian inference with 10,000 parameters: analytically intractable, direct sampling also impossible - the space is too large. MCMC (Markov Chain Monte Carlo) constructs a Markov chain whose stationary distribution matches the desired posterior. Then it simply walks the chain and collects samples.
Ergodic Theorem
For an ergodic chain (irreducible and aperiodic), the time average along a trajectory equals the space average under the stationary distribution:
This is a powerful result: one does not need to know pi explicitly. Running the chain and averaging a function over the trajectory is sufficient.
Mixing Time
Mixing time is how many steps it takes for a chain to forget its initial state. Formally: the minimum n such that the chain's distribution is epsilon-close to pi in total variation distance.
Mixing time determines how many MCMC samples to discard at the start (burn-in). A poorly designed chain can mix exponentially slowly.
Metropolis-Hastings: Building the Right Chain
Metropolis-Hastings constructs a reversible chain with a given stationary distribution p(x). At each step: propose a move x -> y from proposal q(y|x), accept with probability:
Detailed balance holds by construction: p(x) alpha(x,y) = p(y) alpha(y,x). This guarantees that p is the stationary distribution of the chain.
Stan, PyMC, and NumPyro - popular Bayesian inference frameworks - use MCMC under the hood. PyMC3 builds chains with NUTS (No-U-Turn Sampler), an improved version of Hamiltonian Monte Carlo.
Metropolis-Hastings accepts a proposed move x->y with probability min(1, p(y)/p(x)) for a symmetric proposal. Why does this lead to stationary distribution p(x)?
Hidden Markov Models and the Viterbi Algorithm
Speech, text, proteins - hidden states generate observations. A Hidden Markov Model (HMM) separates a hidden Markov process X from visible observations Y: a POS tag (hidden) generates a word (visible). Google Assistant, Apple Siri, and the early Dragon NaturallySpeaking speech recognition systems all ran on HMMs.
HMM Structure
An HMM is defined by the triple (A, B, pi0):
- A = hidden state transition matrix: A[i,j] = P(Xₙ₊₁=j | Xₙ=i)
- B = emission matrix: B[i,k] = P(Yₙ=k | Xₙ=i)
- pi0 = initial hidden state distribution
| Task | Algorithm | Complexity |
|---|---|---|
| P(Y) - probability of observations | Forward algorithm | O(N^2 T) |
| Best hidden path X* | Viterbi | O(N^2 T) |
| Training (A, B from data) | Baum-Welch (EM) | O(N^2 T) x iter |
The Viterbi Algorithm
Viterbi finds the most probable hidden state path X* = argmax P(X|Y) in O(N*T) using dynamic programming. Core recursion:
States: S1 (sunny), S2 (rainy) Observations: walk (W), shop (S), clean (C) A = [[0.7, 0.3], [0.4, 0.6]] B = [[0.5, 0.4, 0.1], [0.1, 0.3, 0.6]] pi0 = [0.6, 0.4] Y = [W, S, C] t=1 (W): delta[S1]=0.6*0.5=0.30, delta[S2]=0.4*0.1=0.04 t=2 (S): delta[S1]=max(0.30*0.7, 0.04*0.4)*0.4 = 0.084 delta[S2]=max(0.30*0.3, 0.04*0.6)*0.3 = 0.027 t=3 (C): delta[S1]=max(0.084*0.7, 0.027*0.4)*0.1 = 0.00588 delta[S2]=max(0.084*0.3, 0.027*0.6)*0.6 = 0.01512 X* = [S1, S1, S2], probability = 0.01512
HMM to Transformers: the Evolution
Attention in transformers can be viewed as a soft Viterbi: instead of argmax over a hidden path, it computes a weighted sum over all positions. BERT and GPT replaced HMMs in NLP precisely because attention is not limited by the Markov property - it attends to the entire sequence at once.
HMMs are still used in bioinformatics (gene prediction, protein sequence alignment via profile HMMs in HMMER) and in low-resource tasks where there is not enough data to train a transformer.
Why does the Viterbi algorithm use max instead of summation (as in the Forward algorithm)?
Key Ideas
- Markov property: P(Xₙ₊₁|X₀..Xₙ) = P(Xₙ₊₁|Xₙ) - only the present matters for the future
- Transition matrix P: row-stochastic (Pᵢⱼ >= 0, each row sums to 1)
- n-step transitions: (P^n)ᵢⱼ - Chapman-Kolmogorov
- Stationary distribution: piP = pi, unique for irreducible aperiodic chains
- PageRank = stationary distribution of the web graph with damping d=0.85
- MCMC (Metropolis-Hastings): build a chain with the desired stationary distribution for Bayesian inference
- HMM + Viterbi: hidden states + observations, O(N^2*T) search for the optimal path
Related Topics
Markov chains are the foundation of a large part of modern ML and statistics.
- Reinforcement Learning (MDP) — Markov Decision Process - a Markov chain with actions and rewards, the foundation of Q-learning and Policy Gradient
- Poisson Processes — Markov chains in continuous time - arrivals, queueing theory
- Eigenvectors and PageRank — pi is the left eigenvector of P for lambda=1; connection to spectral graph theory
- Bayesian Inference — MCMC makes posterior inference tractable for complex models