Stochastic Processes
Stochastic Processes in ML and Systems
SGD that trains every neural network is a stochastic process. Stable Diffusion and DALL-E are SDEs. The server infrastructure under ML models is described by queueing theory. Stochastic processes are not abstract mathematics-they are working tools for the ML engineer.
- **Neural network training**-SGD as Langevin dynamics; Robbins-Monro conditions for choosing the learning rate schedule
- **Generative models (DDPM, Stable Diffusion)**-forward and reverse SDEs; score matching as an approximation problem for the score function
- **ML infrastructure**-inference batching via queueing theory; tail latency via M/G/1 with heavy tails
Предварительные знания
SGD Convergence as a Stochastic Process
Stochastic gradient descent is a random process: each step depends on a random mini-batch. The trajectory θ_0, θ_1, θ_2, ... in parameter space is a Markov chain (the next state depends only on the current one). Analyzing SGD convergence through martingale theory and stochastic processes explains why decaying learning rates are both necessary and sufficient for convergence.
**SGD as a stochastic process:** θ_{t+1} = θ_t - η_t · ∇̃L(θ_t) where ∇̃L(θ_t) = ∇L(θ_t) + ξ_t is the stochastic gradient (true + noise). **Robbins-Monro conditions for convergence:** ∑ η_t = ∞ (steps large enough to reach the optimum) ∑ η_t² < ∞ (steps small enough to stop jumping forever) Typical choices: η_t = η₀ / t (polynomial decay) or η_t = η₀ / √t.
**Connection to Brownian motion:** with small η_t and many steps, the SGD trajectory approximates the **Langevin SDE**: dθ = -∇L(θ)dt + √(2T) dB, where T is the "temperature" (noise scale). This allows analyzing SGD with SDE tools and gives theoretical grounding for SGD as MCMC (Langevin Monte Carlo samples from the posterior over network weights).
| SGD regime | η_t condition | Behavior near optimum | Use case |
|---|---|---|---|
| Decaying η | ∑η_t=∞, ∑η_t²<∞ | Converges a.s. to optimum | Classical optimization |
| Constant η | η_t = const | Fluctuates, E[‖θ-θ*‖²] ~ η·σ² | Online learning, SGLD |
| Cosine schedule | η_t = cosine decay | "Cooling" effect | Training transformers |
| Warmup + decay | η_t = warmup·schedule | Stabilizes early training | BERT, GPT training |
Robbins and Monro, 1951
The Robbins-Monro conditions were formulated in 1951 for stochastic approximation-finding the zero of a function through noisy observations. The same conditions turned out to govern SGD convergence in machine learning. In 2011 Welling and Teh proposed SGLD (Stochastic Gradient Langevin Dynamics)-adding Gaussian noise at the right scale turns SGD into an MCMC sampler.
SGD is just GD with noise, and the theory is the same for both
SGD is a Markov chain in parameter space. Its convergence requires specific conditions (Robbins-Monro), and its behavior near the optimum is described by the Langevin SDE. Deterministic GD theory does not directly apply
In GD the gradient is deterministic, so a constant rate suffices. In SGD every step is random: the total variance must be finite (∑η_t² < ∞), otherwise θ_t does not converge to a point but keeps fluctuating around it
SGD with η_t = 0.01/t (decaying) trains slower than with η = 0.01 (constant). Why is the decaying rate still preferred for theoretical convergence guarantees?
Random Graphs and Information Spreading
Social networks, transport routes, neural connections-all are randomly constructed graphs. The **Erdős-Rényi model G(n, p)**-n vertices, each edge present independently with probability p-is the simplest random graph model. It allows analyzing the spread of information, viruses, and rumors. The key question: at what p does the graph become connected?
**Erdős-Rényi model G(n, p):** - Number of edges: Bin(n(n-1)/2, p), mean p·n(n-1)/2 - Expected degree: (n-1)p **Threshold phenomena:** - p < 1/n: only small components; largest is ~ log n - p = 1/n: threshold for a giant component to appear - p > (1+ε)/n: giant component covers a fraction Θ(1) of vertices - p > log(n)/n: graph is connected a.s. **Spreading on a graph** is a Markov process on vertex states.
**Graph threshold phenomena** are phase transitions: a small change in p qualitatively changes the graph structure. Analogous thresholds appear in distributed systems: **epidemic gossip protocols** (how Cassandra propagates schema changes), **consensus algorithms** (quorum = graph connectivity under failures), **percolation** (server cluster connectivity under partial failures).
| Threshold p | Graph structure | Systems analogy |
|---|---|---|
| p < 1/n | Only small components | Network split into isolated clusters |
| p ≈ 1/n | Giant component emerges | Threshold connectivity in gossip |
| p > log(n)/n | Full connectivity | Broadcast reaches all nodes |
| p = c/n (c>1) | Giant component covers ~βn vertices | β depends nonlinearly on c |
In G(n, p) with p > 0, all vertices are connected with positive probability
For p < log(n)/n, with high probability isolated vertices exist (degree 0). The connectivity threshold is exactly p = log(n)/n: only above it is the graph connected a.s.
An isolated vertex has probability (1-p)^{n-1} ≈ e^{-pn}. At p = log(n)/n: P(isolated) ≈ 1/n. Expected number of isolated vertices ≈ 1. This is the critical point: below the threshold isolated vertices exist; above it they don't
In G(1000, p), at what p does the graph become connected with high probability?
Queues in Distributed Systems
Real systems are more complex than M/M/1: a GPU cluster processes batch LLM requests, multiple replicas balance load, and inference time varies. Applying queueing theory to ML infrastructure is a direct practical payoff of stochastic processes for the ML engineer.
**Key models for ML systems:** **M/M/c (c servers):** Mean wait = (C(c, ρ) / (c·μ·(1-ρ/c))) + 1/μ where C(c, ρ) is the Erlang-C formula, ρ = λ/(cμ) **Batching:** Processing B requests together takes time τ(B) that grows sub-proportionally to B (GPU efficiency). Optimal batch: minimize E[wait] + E[processing] over B. **Incast problem:** N microservices responding simultaneously to one request-a Poisson flood. P(at least one > T) → 1 as N → ∞.
**Tail latency** is especially important in ML systems: P99 latency can be 10x the median due to garbage collection, container cold starts, or inference variability. The M/G/1 Pollaczek-Khinchine formula shows that reducing CV (coefficient of variation) of processing time is the main lever. Request hedging (send to multiple replicas, take the first response) reduces the tail CV.
| System pattern | Queue model | Key metric | Optimization |
|---|---|---|---|
| LLM inference (single) | M/G/1 | W = E[S]/(1-ρ) · (1+CV²)/2 | Reduce CV via batching |
| Load balancer + N replicas | M/M/N | Erlang-C formula | Horizontal scaling |
| Batch inference | M[X]/G/1 (bulk) | Wait vs throughput tradeoff | Optimal batch size |
| Streaming (autoregressive) | Chain of M/M/1 | Total chain latency | Jackson networks |
Linear scaling: N servers = N-fold improvement in latency
The improvement from adding servers is nonlinear. At low load (ρ≪1) a second server barely helps. At high load (ρ→1), a second server reduces waiting exponentially. This is the Erlang-C formula
Queue length depends nonlinearly on load: L = ρ/(1-ρ). At ρ=0.9, L=9; at ρ=0.45 (two servers, same λ), L=0.82. That's an 11-fold reduction from doubling the number of servers-exactly the nonlinearity of queueing theory
LLM inference service: λ=10 req/s, μ=12 req/s (exponential). Adding a second replica (M/M/2)-how does mean queue waiting time change?
Diffusion Generative Models
Stable Diffusion, DALL-E 3, Sora-behind all these models is one idea: gradually add Gaussian noise to an image (forward process) and train a neural network to reverse it (reverse process). The forward process is an Ito SDE; the reverse is also an SDE, connected to the forward via the score function ∇ₓ log p(x).
**DDPM (Denoising Diffusion Probabilistic Models):** Forward process (noise addition): dxₜ = -½β(t)xₜ dt + √β(t) dBₜ This is an Ornstein-Uhlenbeck SDE: the initial image x₀ gradually becomes N(0,I). Reverse process (image generation): dxₜ = [-½β(t)xₜ - β(t)·∇ₓ log p_t(xₜ)] dt + √β(t) dB̃ₜ where ∇ₓ log p_t(x) is the score function: a neural network s_θ(x, t) is trained to approximate it.
**Score matching:** the neural network s_θ(x, t) approximates ∇ₓ log p_t(x)-the gradient of the log density with respect to x. This allows the reverse SDE to be written without knowing the normalizing constant. Equivalently to DDPM: predicting the added noise ε is identical to predicting the score (up to a scale factor). This is an elegant connection between stochastic process theory and modern generative AI.
| Model | Forward SDE | Reverse SDE | Sampler |
|---|---|---|---|
| DDPM | dX = -X/2 dt + dB (OU) | Reverse OU with score | DDPM (1000 steps) |
| DDIM | Same forward | Deterministic ODE | DDIM (50 steps) |
| Score SDE (VP) | dX = -β(t)/2 X dt + √β(t) dB | Reverse with score | Euler-Maruyama |
| Flow Matching | ODE: dX = v_θ(X,t)dt | Direct v_θ | Euler ODE (10-20 steps) |
DDPM trains a neural network to "draw images"
DDPM trains a network to estimate the score function ∇ₓ log p_t(x)-the gradient of the log density-or equivalently to predict the added noise ε. This allows the forward SDE to be reversed and samples drawn from p(x₀)
The network does not store images or "know" what they look like. It learns on pairs (x_t, ε) to predict the noise added at step t. Via the score function, this amounts to learning the geometry of the data manifold
The DDPM forward process is an Ornstein-Uhlenbeck SDE. What happens to the original image x₀ as t → ∞?
Key Ideas
- **SGD**-Markov chain in parameter space; convergence requires Robbins-Monro conditions; connection to Langevin SDE
- **Random graphs** G(n,p)-threshold phenomena at p = 1/n (giant component) and log(n)/n (connectivity); applications in gossip protocols
- **Queues in ML systems**-M/M/c for replicas, M/G/1 for inference with variable CV; optimal batching via latency-throughput tradeoff
- **Diffusion models**-forward OU-SDE erases information; reverse SDE with score function generates; DDPM = noise prediction ≡ score prediction
Related Topics
Stochastic in ML unites the entire course in practical applications:
- MCMC and Sampling — SGD with noise (SGLD) is MCMC; diffusion models generate via the reverse SDE analogously to Langevin MCMC
- Stochastic Differential Equations — DDPM forward/reverse processes are concrete SDEs; numerical solution via Euler-Maruyama = DDPM inference
- Queueing Theory — M/G/1, Jackson networks-the foundation for analyzing latency and throughput of ML services
Вопросы для размышления
- SGLD adds Gaussian noise to every SGD step. Under what conditions does this become a correct MCMC sampler for the posterior distribution over network weights?
- In DDPM the score function ∇ₓ log p_t(x) is needed for the reverse process. What does this vector represent physically for an intermediate noisy image x_t?
- P99 latency of an inference service is 10x the median. Using the P-K formula for M/G/1: what should change in the service design to reduce P99?