Stochastic Processes
Continuous-Time Markov Chains
AWS Lambda cold start: a container wakes up, handles one request, then dies. Mean lifetime follows a Poisson process. The request queue is an M/M/c birth-death process. AWS engineers solve $\pi Q = 0$ for M/M/c - find the stationary queue-length distribution - and read off optimal provisioned concurrency. Kolmogorov 1931 -> AWS 2024. The same mathematics runs Folding@home (protein conformation simulations) and COVID-19 epidemiological models.
- **ML infrastructure:** M/M/c queues for GPU cluster autoscaling - Microsoft AzureML, Google Vertex AI; Gillespie algorithm for biochemical simulations in AlphaFold protein dynamics
- **Reinforcement learning:** Markov Decision Process in continuous time - CTMC with control; Hamilton-Jacobi-Bellman equation is the continuous-time analogue of Kolmogorov equations; policy gradient in continuous-time RL
- **Reliability engineering:** component failure and repair modeled as CTMC; Erlang-B/C formulas in call-routing algorithms and SRE p99-latency optimization
Предварительные знания
CTMC and the Q-Matrix
AWS Lambda cold start. A container wakes up, handles one request, then dies. The mean lifetime follows a Poisson process. The queue of incoming requests is an M/M/c birth-death process. AWS engineers solve a single equation - $\pi Q = 0$ - to find optimal provisioned concurrency and shave p99 latency. The same mathematics drives Folding@home when simulating protein conformations and epidemiological models of COVID-19. That is continuous-time - CTMC.
In a DTMC, transitions happen at steps 0, 1, 2, ... In a **CTMC** (Continuous-Time Markov Chain) the system transitions at any moment of continuous time $t \geq 0$. Instead of a probability matrix $P$, one uses a **Q-matrix** (infinitesimal generator) that encodes transition rates. Physical meaning: $q_{ij} \cdot dt$ is the probability of jumping from state $i$ to state $j$ over a small interval $dt$.
**Q-matrix:** $q_{ij} \geq 0$ for $i \neq j$ - transition rate from $i$ to $j$. Diagonal: $q_{ii} = -\sum_{j \neq i} q_{ij}$, so every row sums to zero. Mean holding time in state $i$ equals $1/|q_{ii}|$.
The Gillespie algorithm (1977) - exact stochastic simulation of biochemical reactions - is precisely a CTMC simulator. AlphaFold uses it for protein dynamics: each protein conformation is a state, transition rates form the Q-matrix. Three decades later the same mathematics drives computational biology and drug discovery at scale.
Why are the diagonal elements of the Q-matrix negative?
Birth-Death Processes
A **birth-death process** is a CTMC where transitions are only possible between adjacent states: $n \to n+1$ (birth at rate $\lambda_n$) and $n \to n-1$ (death at rate $\mu_n$). A modest construction - yet the entire queueing theory powering ML infrastructure rests on it. Microsoft AzureML and Google Vertex AI use M/M/c queues for GPU cluster autoscaling: solve for the minimum number of GPUs $c$ to keep warm under peak training load.
The textbook example is the **M/M/1 queue**: jobs arrive as a Poisson process at rate $\lambda$ and are served at rate $\mu$. State $n$ = number of jobs in the system. Utilization $\rho = \lambda/\mu < 1$ is the stability condition. As $\rho \to 1$, the mean queue length $L = \rho/(1-\rho)$ explodes nonlinearly.
| Queue model | lambda_n | mu_n | Application |
|---|---|---|---|
| M/M/1 | lambda (const) | mu (const) | Single GPU / single server |
| M/M/c | lambda (const) | min(n,c)*mu | GPU cluster (AzureML, Vertex AI) |
| M/M/1/K | lambda (n<K), 0 (n=K) | mu (const) | Bounded request buffer |
| M/M/inf | lambda (const) | n*mu | Serverless (each request gets own worker) |
**Stability condition for M/M/1:** $\rho = \lambda/\mu < 1$. When $\rho \geq 1$ the queue grows without bound - no stationary distribution exists. M/M/c is stable when $\rho = \lambda/(c\mu) < 1$ - exactly the calculation behind choosing minimum GPU count $c$ for peak load.
In an M/M/1 queue with $\lambda=4$ and $\mu=5$, the utilization and mean number of jobs are:
Kolmogorov Equations and Stationary Distribution
1931. Kolmogorov derives two equations describing how CTMC probabilities evolve. Ninety-three years later, AWS Lambda optimizers solve the same thing - and call it cost optimization. **Forward equations:** $\frac{dP(t)}{dt} = P(t)Q$ - how probabilities propagate forward in time. **Backward equations:** $\frac{dP(t)}{dt} = QP(t)$ - the backward view. In steady state $\frac{dP}{dt} = 0$, giving $\pi Q = 0$.
**Stationary equation:** $\pi Q = 0$, $\sum_i \pi_i = 1$. Physical meaning: for each state $j$, total flow in equals total flow out. Markov Decision Processes in continuous-time RL are CTMCs with control; the Hamilton-Jacobi-Bellman equation is the continuous-time analogue of $\pi Q = 0$.
For birth-death processes $\pi Q = 0$ reduces to **detailed balance**: $\pi_n \lambda_n = \pi_{n+1} \mu_{n+1}$. Flow from $n$ to $n+1$ equals flow back from $n+1$ to $n$. This recurrence has an analytic closed form - no matrix algebra needed. For M/M/1: $\pi_n = (1-\rho)\rho^n$ - a geometric distribution.
| M/M/1 formula | Expression | Meaning |
|---|---|---|
| Utilization | rho = lambda/mu | Fraction of time server is busy |
| P(n jobs) | pi_n = (1-rho)*rho^n | Geometric distribution |
| Mean in system | L = rho/(1-rho) | Little's law: L = lambda*W |
| Mean time | W = 1/(mu-lambda) | Time from arrival to departure |
Erlang and Kolmogorov: from telephone to AWS
1909. Danish engineer Agner Erlang builds the first CTMC model for the Copenhagen Telephone Company. Question: how many operators are needed to keep fewer than 1% of calls waiting more than one minute? The Erlang-B and Erlang-C formulas are still embedded in every call-routing algorithm. Kolmogorov provides the rigorous mathematical foundation in 1931. The unit of telephone traffic "erlang" - and every modern SRE p99-latency metric - are descendants of one formula.
DTMC and CTMC are the same thing, just in different time domains
CTMC is not a DTMC with faster steps. It requires exponentially distributed holding times (memoryless property). The Q-matrix encodes rates, not probabilities. Without the memoryless property there is no Markov property in continuous time.
The exponential distribution is the only continuous distribution with the memoryless property: $\mathbb{P}(T > s+t \mid T > s) = \mathbb{P}(T > t)$. If holding times are not exponential (e.g., deterministic service times), how long one has already waited carries information about the future - violating the Markov property.
What does $\pi Q = 0$ mean for a CTMC?
Key Ideas
- **Q-matrix** (generator): $q_{ij} \geq 0$ for $i \neq j$, $q_{ii} = -\sum_{j \neq i} q_{ij}$, rows sum to zero; $P(t) = e^{Qt}$; mean holding time in state $i$ equals $1/|q_{ii}|$
- **Birth-death process**: transitions only between adjacent states; M/M/c models GPU-cluster autoscaling; stability condition $\rho = \lambda/(c\mu) < 1$
- **Kolmogorov equations**: forward $dP/dt = PQ$, backward $dP/dt = QP$; stationary $\pi Q = 0$; detailed balance $\pi_n \lambda_n = \pi_{n+1} \mu_{n+1}$ for birth-death
- **Memoryless property**: exponential distribution is the only continuous distribution satisfying $\mathbb{P}(T>s+t|T>s)=\mathbb{P}(T>t)$; this is what makes the Markov property possible in continuous time
Related Topics
CTMC bridges theory and infrastructure:
- Discrete-Time Markov Chains — CTMC generalizes DTMC to continuous time; Q-matrix of rates replaces P-matrix of probabilities
- Stochastic Processes: Definitions — CTMC is the canonical example of a stationary ergodic process with the Markov property
- MCMC and Sampling — Metropolis-Hastings and HMC construct ergodic CTMCs with the desired posterior as stationary distribution
Вопросы для размышления
- An AWS Lambda optimizer solves $\pi Q = 0$ for M/M/c with five states (0..4 warm containers). Which Q-matrix parameter governs cold-start speed, and which governs throughput? How does shifting $\mu$ change the optimal container count?
- Real GPU job durations are not exponential - they may be deterministic (fixed batch size) or log-normal. How does this affect M/M/1 accuracy? What is an M/G/1 queue and why is it harder to analyze?
- At $\rho = 0.99$ the mean M/M/1 queue length is 99. The system is stable, but p99 latency is catastrophic. What two levers does an SRE engineer have - and how does detailed balance explain why adding a second server (M/M/2) reduces the queue nonlinearly?
Связанные уроки
- sp-02 — CTMC extends DTMC - Q-matrix replaces P-matrix
- sp-01 — Markov property and stochastic process foundations
- sp-04 — Martingales and stopping times - next layer of stochastic analysis
- sp-12 — MCMC builds on ergodic chains with target stationary distribution
- prob-18-poisson — Poisson process is a degenerate CTMC with a single state
- prob-17 — Parallel view of Markov chains from the probability course