Probability Theory

Markov Chains

Google's PageRank is the stationary distribution of a Markov chain on the web graph. The algorithm ranks 130 trillion pages by solving a single matrix equation. MCMC methods used in Bayesian ML to sample from complex posteriors are also Markov chains at their core.

PageRank (Google): stationary distribution on the web graph ranks 130T pages
MCMC in Bayesian ML: Metropolis-Hastings and HMC sample from posterior distributions
Hidden Markov Models: speech recognition in Siri, Google Assistant, and Alexa
Large language models: next-token prediction is a high-order Markov model
Sequential A/B testing: Markov-based stopping rules for early experiment termination
Finance: Vasicek and Hull-White interest rate models as Markov diffusions

Markov Property and Transition Matrix

PageRank - the algorithm Google uses to rank billions of pages - is the stationary distribution of a Markov chain. Every search query in the world implicitly solves the equation **pi = piP**. Andrey Markov invented this structure in 1906 while analyzing the alternation of vowels and consonants in Pushkin's Eugene Onegin. Today it is embedded in every recommendation system, language model, and Reinforcement Learning engine.

The Markov Property

A Markov chain is a sequence of random variables X₀, X₁, X₂, ... with the property that the future depends only on the present, not on the past.

Example: weather. States - sunny (S) and rainy (R). Tomorrow's weather depends only on today's, not on the entire week. Transition matrix:

From \ To	S (tomorrow)	R (tomorrow)
S (today)	0.8	0.2
R (today)	0.4	0.6

In matrix form, P is a row-stochastic matrix: each element Pᵢⱼ >= 0 and each row sums to 1.

n-step Transitions: Chapman-Kolmogorov

The probability of going from state i to state j in n steps is the (i,j) element of the matrix P^n. The Chapman-Kolmogorov equation:

For the weather example: P^2 gives two-day transition probabilities. As n -> inf, the matrix P^n converges - each row becomes the same vector. That vector is the stationary distribution.

P^n converges regardless of the starting row - at n=50 all rows of the matrix are identical. The initial state is forgotten.

Transition matrix P = [[0.7, 0.3], [0.5, 0.5]]. What is the probability of being in state 1 after 2 steps, starting from state 0?

Stationary Distribution

The stationary distribution pi is a probability vector that the chain does not move. Starting from distribution pi, one step later the distribution is still pi. It is the **left eigenvector** of P for eigenvalue 1.

For the weather example: solve pi*P = pi.

pi = [pi_S, pi_R] pi_S = 0.8*pi_S + 0.4*pi_R pi_R = 0.2*pi_S + 0.6*pi_R pi_S + pi_R = 1 From the first equation: 0.2*pi_S = 0.4*pi_R => pi_S = 2*pi_R Substituting: 2*pi_R + pi_R = 1 => pi_R = 1/3, pi_S = 2/3 pi = [2/3, 1/3] ~= [0.667, 0.333]

The stationary distribution does not depend on the initial state - only on the structure of P. Whether starting from sun or rain, after many steps the distribution is the same.

Existence and Detailed Balance

Perron-Frobenius theorem: every **irreducible** and **aperiodic** finite chain has a unique stationary distribution, and P^n converges to it.

Irreducible: every state is reachable from every other state
Aperiodic: no cyclic traps (gcd of all cycle lengths = 1)

The detailed balance condition (reversibility) is a stronger property that guarantees stationarity:

A chain satisfying detailed balance is called reversible. Metropolis-Hastings specifically constructs reversible chains.

PageRank: pi = stationary distribution of the web graph

A random surfer follows links with probability d=0.85 (damping factor), and jumps to a random page with probability 0.15. PageRank is the stationary distribution of this chain:

Power iteration is the fastest way to find pi for large sparse graphs. Google originally applied it to a web graph of billions of pages.

Why does PageRank need a damping factor d=0.85? What happens without it (d=1)?

Ergodicity and MCMC

Bayesian inference with 10,000 parameters: analytically intractable, direct sampling also impossible - the space is too large. MCMC (Markov Chain Monte Carlo) constructs a Markov chain whose stationary distribution matches the desired posterior. Then it simply walks the chain and collects samples.

Ergodic Theorem

For an ergodic chain (irreducible and aperiodic), the time average along a trajectory equals the space average under the stationary distribution:

This is a powerful result: one does not need to know pi explicitly. Running the chain and averaging a function over the trajectory is sufficient.

Mixing Time

Mixing time is how many steps it takes for a chain to forget its initial state. Formally: the minimum n such that the chain's distribution is epsilon-close to pi in total variation distance.

Mixing time determines how many MCMC samples to discard at the start (burn-in). A poorly designed chain can mix exponentially slowly.

Metropolis-Hastings: Building the Right Chain

Metropolis-Hastings constructs a reversible chain with a given stationary distribution p(x). At each step: propose a move x -> y from proposal q(y|x), accept with probability:

Detailed balance holds by construction: p(x) alpha(x,y) = p(y) alpha(y,x). This guarantees that p is the stationary distribution of the chain.

Stan, PyMC, and NumPyro - popular Bayesian inference frameworks - use MCMC under the hood. PyMC3 builds chains with NUTS (No-U-Turn Sampler), an improved version of Hamiltonian Monte Carlo.

Metropolis-Hastings accepts a proposed move x->y with probability min(1, p(y)/p(x)) for a symmetric proposal. Why does this lead to stationary distribution p(x)?

Hidden Markov Models and the Viterbi Algorithm

Speech, text, proteins - hidden states generate observations. A Hidden Markov Model (HMM) separates a hidden Markov process X from visible observations Y: a POS tag (hidden) generates a word (visible). Google Assistant, Apple Siri, and the early Dragon NaturallySpeaking speech recognition systems all ran on HMMs.

HMM Structure

An HMM is defined by the triple (A, B, pi0):

A = hidden state transition matrix: A[i,j] = P(Xₙ₊₁=j | Xₙ=i)
B = emission matrix: B[i,k] = P(Yₙ=k | Xₙ=i)
pi0 = initial hidden state distribution

Task	Algorithm	Complexity
P(Y) - probability of observations	Forward algorithm	O(N^2 T)
Best hidden path X*	Viterbi	O(N^2 T)
Training (A, B from data)	Baum-Welch (EM)	O(N^2 T) x iter

The Viterbi Algorithm

Viterbi finds the most probable hidden state path X* = argmax P(X|Y) in O(N*T) using dynamic programming. Core recursion:

States: S1 (sunny), S2 (rainy) Observations: walk (W), shop (S), clean (C) A = [[0.7, 0.3], [0.4, 0.6]] B = [[0.5, 0.4, 0.1], [0.1, 0.3, 0.6]] pi0 = [0.6, 0.4] Y = [W, S, C] t=1 (W): delta[S1]=0.6*0.5=0.30, delta[S2]=0.4*0.1=0.04 t=2 (S): delta[S1]=max(0.30*0.7, 0.04*0.4)*0.4 = 0.084 delta[S2]=max(0.30*0.3, 0.04*0.6)*0.3 = 0.027 t=3 (C): delta[S1]=max(0.084*0.7, 0.027*0.4)*0.1 = 0.00588 delta[S2]=max(0.084*0.3, 0.027*0.6)*0.6 = 0.01512 X* = [S1, S1, S2], probability = 0.01512

HMM to Transformers: the Evolution

Attention in transformers can be viewed as a soft Viterbi: instead of argmax over a hidden path, it computes a weighted sum over all positions. BERT and GPT replaced HMMs in NLP precisely because attention is not limited by the Markov property - it attends to the entire sequence at once.

HMMs are still used in bioinformatics (gene prediction, protein sequence alignment via profile HMMs in HMMER) and in low-resource tasks where there is not enough data to train a transformer.

Why does the Viterbi algorithm use max instead of summation (as in the Forward algorithm)?

Key Ideas

Markov property: P(Xₙ₊₁|X₀..Xₙ) = P(Xₙ₊₁|Xₙ) - only the present matters for the future
Transition matrix P: row-stochastic (Pᵢⱼ >= 0, each row sums to 1)
n-step transitions: (P^n)ᵢⱼ - Chapman-Kolmogorov
Stationary distribution: piP = pi, unique for irreducible aperiodic chains
PageRank = stationary distribution of the web graph with damping d=0.85
MCMC (Metropolis-Hastings): build a chain with the desired stationary distribution for Bayesian inference
HMM + Viterbi: hidden states + observations, O(N^2*T) search for the optimal path

Связанные уроки

Probability Theory

Markov Chains

PageRank (Google): stationary distribution on the web graph ranks 130T pages
MCMC in Bayesian ML: Metropolis-Hastings and HMC sample from posterior distributions
Hidden Markov Models: speech recognition in Siri, Google Assistant, and Alexa
Large language models: next-token prediction is a high-order Markov model
Sequential A/B testing: Markov-based stopping rules for early experiment termination
Finance: Vasicek and Hull-White interest rate models as Markov diffusions

Markov Property and Transition Matrix

The Markov Property

A Markov chain is a sequence of random variables X₀, X₁, X₂, ... with the property that the future depends only on the present, not on the past.

Example: weather. States - sunny (S) and rainy (R). Tomorrow's weather depends only on today's, not on the entire week. Transition matrix:

From \ To	S (tomorrow)	R (tomorrow)
S (today)	0.8	0.2
R (today)	0.4	0.6

In matrix form, P is a row-stochastic matrix: each element Pᵢⱼ >= 0 and each row sums to 1.

n-step Transitions: Chapman-Kolmogorov

The probability of going from state i to state j in n steps is the (i,j) element of the matrix P^n. The Chapman-Kolmogorov equation:

For the weather example: P^2 gives two-day transition probabilities. As n -> inf, the matrix P^n converges - each row becomes the same vector. That vector is the stationary distribution.

P^n converges regardless of the starting row - at n=50 all rows of the matrix are identical. The initial state is forgotten.

Transition matrix P = [[0.7, 0.3], [0.5, 0.5]]. What is the probability of being in state 1 after 2 steps, starting from state 0?

Stationary Distribution

For the weather example: solve pi*P = pi.

The stationary distribution does not depend on the initial state - only on the structure of P. Whether starting from sun or rain, after many steps the distribution is the same.

Existence and Detailed Balance

Perron-Frobenius theorem: every **irreducible** and **aperiodic** finite chain has a unique stationary distribution, and P^n converges to it.

Irreducible: every state is reachable from every other state
Aperiodic: no cyclic traps (gcd of all cycle lengths = 1)

The detailed balance condition (reversibility) is a stronger property that guarantees stationarity:

A chain satisfying detailed balance is called reversible. Metropolis-Hastings specifically constructs reversible chains.

PageRank: pi = stationary distribution of the web graph

A random surfer follows links with probability d=0.85 (damping factor), and jumps to a random page with probability 0.15. PageRank is the stationary distribution of this chain:

Power iteration is the fastest way to find pi for large sparse graphs. Google originally applied it to a web graph of billions of pages.

Why does PageRank need a damping factor d=0.85? What happens without it (d=1)?

Ergodicity and MCMC

Ergodic Theorem

For an ergodic chain (irreducible and aperiodic), the time average along a trajectory equals the space average under the stationary distribution:

This is a powerful result: one does not need to know pi explicitly. Running the chain and averaging a function over the trajectory is sufficient.

Mixing Time

Mixing time is how many steps it takes for a chain to forget its initial state. Formally: the minimum n such that the chain's distribution is epsilon-close to pi in total variation distance.

Mixing time determines how many MCMC samples to discard at the start (burn-in). A poorly designed chain can mix exponentially slowly.

Metropolis-Hastings: Building the Right Chain

Metropolis-Hastings constructs a reversible chain with a given stationary distribution p(x). At each step: propose a move x -> y from proposal q(y|x), accept with probability:

Detailed balance holds by construction: p(x) alpha(x,y) = p(y) alpha(y,x). This guarantees that p is the stationary distribution of the chain.

Stan, PyMC, and NumPyro - popular Bayesian inference frameworks - use MCMC under the hood. PyMC3 builds chains with NUTS (No-U-Turn Sampler), an improved version of Hamiltonian Monte Carlo.

Metropolis-Hastings accepts a proposed move x->y with probability min(1, p(y)/p(x)) for a symmetric proposal. Why does this lead to stationary distribution p(x)?

Hidden Markov Models and the Viterbi Algorithm

HMM Structure

An HMM is defined by the triple (A, B, pi0):

A = hidden state transition matrix: A[i,j] = P(Xₙ₊₁=j | Xₙ=i)
B = emission matrix: B[i,k] = P(Yₙ=k | Xₙ=i)
pi0 = initial hidden state distribution

Task	Algorithm	Complexity
P(Y) - probability of observations	Forward algorithm	O(N^2 T)
Best hidden path X*	Viterbi	O(N^2 T)
Training (A, B from data)	Baum-Welch (EM)	O(N^2 T) x iter

The Viterbi Algorithm

Viterbi finds the most probable hidden state path X* = argmax P(X|Y) in O(N*T) using dynamic programming. Core recursion:

HMM to Transformers: the Evolution

HMMs are still used in bioinformatics (gene prediction, protein sequence alignment via profile HMMs in HMMER) and in low-resource tasks where there is not enough data to train a transformer.

Why does the Viterbi algorithm use max instead of summation (as in the Forward algorithm)?

Key Ideas

Markov property: P(Xₙ₊₁|X₀..Xₙ) = P(Xₙ₊₁|Xₙ) - only the present matters for the future
Transition matrix P: row-stochastic (Pᵢⱼ >= 0, each row sums to 1)
n-step transitions: (P^n)ᵢⱼ - Chapman-Kolmogorov
Stationary distribution: piP = pi, unique for irreducible aperiodic chains
PageRank = stationary distribution of the web graph with damping d=0.85
MCMC (Metropolis-Hastings): build a chain with the desired stationary distribution for Bayesian inference
HMM + Viterbi: hidden states + observations, O(N^2*T) search for the optimal path

Markov Chains

Markov Property and Transition Matrix

The Markov Property

n-step Transitions: Chapman-Kolmogorov

Stationary Distribution

Existence and Detailed Balance

PageRank: pi = stationary distribution of the web graph

Ergodicity and MCMC

Ergodic Theorem

Mixing Time

Metropolis-Hastings: Building the Right Chain

Hidden Markov Models and the Viterbi Algorithm

HMM Structure

The Viterbi Algorithm

HMM to Transformers: the Evolution

Key Ideas

Related Topics

Связанные уроки

Markov Chains

Markov Property and Transition Matrix

The Markov Property

n-step Transitions: Chapman-Kolmogorov

Stationary Distribution

Existence and Detailed Balance

PageRank: pi = stationary distribution of the web graph

Ergodicity and MCMC

Ergodic Theorem

Mixing Time

Metropolis-Hastings: Building the Right Chain

Hidden Markov Models and the Viterbi Algorithm

HMM Structure

The Viterbi Algorithm

HMM to Transformers: the Evolution

Key Ideas

Related Topics

Связанные уроки