Stochastic Processes

Impulse Control and Continuous-State MDPs

Цели урока

Derive the HJB equation from Bellman's optimality principle
Solve the LQR problem through the algebraic Riccati equation
Apply Pontryagin's maximum principle and BSDE for nonlinear problems
Connect stochastic control to Deep RL algorithms

Предварительные знания

Stochastic differential equations
Levy processes
Ito's formula

SpaceX lands a rocket vertically. The controller is not a neural network. It is LQR plus the Riccati equation, computed offline. A 6x6 matrix multiplied by the state vector - that is the entire onboard controller.

SpaceX Falcon 9: LQR control for first-stage landing
DeepMind: HJB as the mathematical foundation of Deep Q-Network
RLHF: Bellman equation for fine-tuning language models
Tesla FSD: iLQR planner for nonlinear vehicle dynamics

Pontryagin, Bellman, and the Space Race

Richard Bellman formulated the principle of dynamic programming in 1957 while at RAND Corporation. Lev Pontryagin and his team (Boltyansky, Gamkrelidze, Mishchenko) obtained the maximum principle in Moscow in 1956-57. Two approaches - HJB and Pontryagin - were developed in parallel at the height of the Cold War. Rudolf Kalman solved the LQR problem in 1960 and introduced the Riccati equation. The connection to Deep RL was made explicit by Weinan E in 2017.

The Hamilton-Jacobi-Bellman Equation

This lesson extends classical stochastic control: impulse controls with discrete actions, optimal stopping via quasi-variational inequalities, and MDPs with continuous state spaces. DeepMind, 2015. DQN beats humans in 49 of 57 Atari games. The mathematics behind this is the Bellman equation. Not a heuristic, not a trick - a rigorous theorem: the optimal policy is found through a PDE for the value function. Stochastic control is reinforcement learning with a proof.

HJB in RLHF

Reinforcement Learning from Human Feedback as stochastic control

RLHF (Ouyang et al., 2022): fine-tuning a language model with human feedback. State: conversation context. Control: next token. Noise: randomness in sampling. Value function V(context): expected reward. The discrete-time Bellman equation is a direct analog of HJB. PPO minimizes the Bellman residual.

The HJB equation is a PDE of dimension d+1 (state plus time). At d > 5 this is the curse of dimensionality. Numerical solutions: deep Ritz method or deep BSDE - neural networks as approximators of V*.

Why does V(t,x) satisfy a PDE rather than an ODE?

V depends on two arguments: time t and d-dimensional state x. Bellman's dynamic programming principle yields an equation involving both partial_t V and grad_x V - making it a PDE. Stochasticity adds the diffusion term.

The Linear-Quadratic Regulator and Riccati Equation

SpaceX landed the Falcon 9 first stage in 2015. The controller: LQR with state feedback. State: rocket position and velocity. Control: engine thrust. The Riccati equation is solved once offline. During landing: pure matrix multiplication.

The Riccati equation is solved once - then the control is a matrix-vector multiplication. This is fundamentally faster than deep RL, where a neural network performs a forward pass at every step. For safety-critical systems (rockets, autopilots), LQR is the standard.

Method	Optimality	Computation	Applications
LQR	Exact (linear system)	ARE solved once	SpaceX Falcon 9, Tesla autopilot
MPC	Approximate (horizon N)	QP at each step	Industrial robots, HVAC
Deep RL (PPO)	Empirical	Neural forward pass	Atari, robotics, LLM (RLHF)
iLQR	Locally optimal	Iterative linearization	Nonlinear robotics (MuJoCo)

iLQR (iterative LQR) linearizes the nonlinear system at each iteration and applies LQR. Used in Google DeepMind's locomotion planners for MuJoCo-based environments.

In LQR the optimal control u* = -K*X is linear in the state. Where does this linearity come from?

V*(x) = x^T P x under linear dynamics and quadratic cost. Substituting into the HJB optimality condition: u* = arg min_u [u^T R u + (2Px)^T B u] = -R^{-1} B^T P x.

Pontryagin's Maximum Principle and BSDE

HJB requires smooth V - the value function must be differentiable. For nonlinear systems this is not guaranteed. Pontryagin's maximum principle is an alternative: optimality is described through adjoint variables (momenta) without a PDE.

The BSDE for the adjoint process is not an abstraction. The backpropagation through time (BPTT) algorithm for RNNs is the discrete-time analog of Pontryagin's maximum principle. The gradient with respect to parameters is the discrete adjoint process. Weinan E (2017) made this bridge explicit.

In the deterministic case (sigma = 0), the stochastic maximum principle reduces to the classical Pontryagin condition from 1956. The BSDE degenerates to a standard adjoint ODE.

What is the BSDE in Pontryagin's maximum principle and why is it solved backward in time?

BSDE: dp_t = -H_x dt + q_t dW_t with p_T = g_x(X_T). The terminal condition means the solution is sought backward from T to 0. This is the stochastic analog of backpropagation of gradients.

Connections to other topics

Stochastic control connects probability theory, optimization, and machine learning

Deep RL (PPO, DQN) — Related topic
BSDE and SPDE — Related topic
LQR and Riccati — Related topic
RLHF — Related topic

Итоги

HJB equation: -dV/dt = min_u [L + f^T grad V + sigma sigma^T : nabla^2 V / 2]
LQR: quadratic V*(x) = x^T P x, matrix P from the algebraic Riccati equation
Pontryagin: adjoint process (p, q) satisfies BSDE with terminal condition
BPTT = discrete Pontryagin principle - backpropagation as adjoint equations

Вопросы для размышления

Why is the HJB equation a PDE of dimension d+1 and how does this relate to the curse of dimensionality in RL?
What is the fundamental difference between LQR and model predictive control (MPC)?
How is BPTT for an RNN a special case of Pontryagin's maximum principle?

Связанные уроки

sp-24-levy-processes — Controlled SDEs driven by Levy noise
sp-20 — Ito's formula is used to derive the HJB equation
sp-26-spde — The Zakai equation is an SPDE arising from nonlinear filtering

The Hamilton-Jacobi-Bellman Equation

HJB in RLHF

Reinforcement Learning from Human Feedback as stochastic control

Why does V(t,x) satisfy a PDE rather than an ODE?

The Linear-Quadratic Regulator and Riccati Equation

Method

Optimality

Computation

Applications

LQR

Exact (linear system)

ARE solved once

SpaceX Falcon 9, Tesla autopilot

MPC

Approximate (horizon N)

QP at each step

Industrial robots, HVAC

Deep RL (PPO)

Empirical

Neural forward pass

Atari, robotics, LLM (RLHF)

iLQR

Locally optimal

Iterative linearization

Nonlinear robotics (MuJoCo)

iLQR (iterative LQR) linearizes the nonlinear system at each iteration and applies LQR. Used in Google DeepMind's locomotion planners for MuJoCo-based environments.

In LQR the optimal control u* = -K*X is linear in the state. Where does this linearity come from?

V*(x) = x^T P x under linear dynamics and quadratic cost. Substituting into the HJB optimality condition: u* = arg min_u [u^T R u + (2Px)^T B u] = -R^{-1} B^T P x.

Pontryagin's Maximum Principle and BSDE

In the deterministic case (sigma = 0), the stochastic maximum principle reduces to the classical Pontryagin condition from 1956. The BSDE degenerates to a standard adjoint ODE.

What is the BSDE in Pontryagin's maximum principle and why is it solved backward in time?

BSDE: dp_t = -H_x dt + q_t dW_t with p_T = g_x(X_T). The terminal condition means the solution is sought backward from T to 0. This is the stochastic analog of backpropagation of gradients.

Итоги

HJB equation: -dV/dt = min_u [L + f^T grad V + sigma sigma^T : nabla^2 V / 2]

LQR: quadratic V*(x) = x^T P x, matrix P from the algebraic Riccati equation

Pontryagin: adjoint process (p, q) satisfies BSDE with terminal condition

BPTT = discrete Pontryagin principle - backpropagation as adjoint equations