Stochastic Processes

Impulse Control and Continuous-State MDPs

Цели урока

  • Derive the HJB equation from Bellman's optimality principle
  • Solve the LQR problem through the algebraic Riccati equation
  • Apply Pontryagin's maximum principle and BSDE for nonlinear problems
  • Connect stochastic control to Deep RL algorithms

Предварительные знания

  • Stochastic differential equations
  • Levy processes
  • Ito's formula
  • Levy processes
  • Brownian motion and Ito's formula

SpaceX lands a rocket vertically. The controller is not a neural network. It is LQR plus the Riccati equation, computed offline. A 6x6 matrix multiplied by the state vector - that is the entire onboard controller.

  • SpaceX Falcon 9: LQR control for first-stage landing
  • DeepMind: HJB as the mathematical foundation of Deep Q-Network
  • RLHF: Bellman equation for fine-tuning language models
  • Tesla FSD: iLQR planner for nonlinear vehicle dynamics

Pontryagin, Bellman, and the Space Race

Richard Bellman formulated the principle of dynamic programming in 1957 while at RAND Corporation. Lev Pontryagin and his team (Boltyansky, Gamkrelidze, Mishchenko) obtained the maximum principle in Moscow in 1956-57. Two approaches - HJB and Pontryagin - were developed in parallel at the height of the Cold War. Rudolf Kalman solved the LQR problem in 1960 and introduced the Riccati equation. The connection to Deep RL was made explicit by Weinan E in 2017.

The Hamilton-Jacobi-Bellman Equation

This lesson extends classical stochastic control: impulse controls with discrete actions, optimal stopping via quasi-variational inequalities, and MDPs with continuous state spaces. DeepMind, 2015. DQN beats humans in 49 of 57 Atari games. The mathematics behind this is the Bellman equation. Not a heuristic, not a trick - a rigorous theorem: the optimal policy is found through a PDE for the value function. Stochastic control is reinforcement learning with a proof.

HJB in RLHF

Reinforcement Learning from Human Feedback as stochastic control

RLHF (Ouyang et al., 2022): fine-tuning a language model with human feedback. State: conversation context. Control: next token. Noise: randomness in sampling. Value function V(context): expected reward. The discrete-time Bellman equation is a direct analog of HJB. PPO minimizes the Bellman residual.

The HJB equation is a PDE of dimension d+1 (state plus time). At d > 5 this is the curse of dimensionality. Numerical solutions: deep Ritz method or deep BSDE - neural networks as approximators of V*.

Why does V(t,x) satisfy a PDE rather than an ODE?

V depends on two arguments: time t and d-dimensional state x. Bellman's dynamic programming principle yields an equation involving both partial_t V and grad_x V - making it a PDE. Stochasticity adds the diffusion term.

The Linear-Quadratic Regulator and Riccati Equation

SpaceX landed the Falcon 9 first stage in 2015. The controller: LQR with state feedback. State: rocket position and velocity. Control: engine thrust. The Riccati equation is solved once offline. During landing: pure matrix multiplication.

The Riccati equation is solved once - then the control is a matrix-vector multiplication. This is fundamentally faster than deep RL, where a neural network performs a forward pass at every step. For safety-critical systems (rockets, autopilots), LQR is the standard.

MethodOptimalityComputationApplications
LQRExact (linear system)ARE solved onceSpaceX Falcon 9, Tesla autopilot
MPCApproximate (horizon N)QP at each stepIndustrial robots, HVAC
Deep RL (PPO)EmpiricalNeural forward passAtari, robotics, LLM (RLHF)
iLQRLocally optimalIterative linearizationNonlinear robotics (MuJoCo)

iLQR (iterative LQR) linearizes the nonlinear system at each iteration and applies LQR. Used in Google DeepMind's locomotion planners for MuJoCo-based environments.

In LQR the optimal control u* = -K*X is linear in the state. Where does this linearity come from?

V*(x) = x^T P x under linear dynamics and quadratic cost. Substituting into the HJB optimality condition: u* = arg min_u [u^T R u + (2Px)^T B u] = -R^{-1} B^T P x.

Pontryagin's Maximum Principle and BSDE

HJB requires smooth V - the value function must be differentiable. For nonlinear systems this is not guaranteed. Pontryagin's maximum principle is an alternative: optimality is described through adjoint variables (momenta) without a PDE.

The BSDE for the adjoint process is not an abstraction. The backpropagation through time (BPTT) algorithm for RNNs is the discrete-time analog of Pontryagin's maximum principle. The gradient with respect to parameters is the discrete adjoint process. Weinan E (2017) made this bridge explicit.

In the deterministic case (sigma = 0), the stochastic maximum principle reduces to the classical Pontryagin condition from 1956. The BSDE degenerates to a standard adjoint ODE.

What is the BSDE in Pontryagin's maximum principle and why is it solved backward in time?

BSDE: dp_t = -H_x dt + q_t dW_t with p_T = g_x(X_T). The terminal condition means the solution is sought backward from T to 0. This is the stochastic analog of backpropagation of gradients.

Connections to other topics

Stochastic control connects probability theory, optimization, and machine learning

  • Deep RL (PPO, DQN) — Related topic
  • BSDE and SPDE — Related topic
  • LQR and Riccati — Related topic
  • RLHF — Related topic

Итоги

  • HJB equation: -dV/dt = min_u [L + f^T grad V + sigma sigma^T : nabla^2 V / 2]
  • LQR: quadratic V*(x) = x^T P x, matrix P from the algebraic Riccati equation
  • Pontryagin: adjoint process (p, q) satisfies BSDE with terminal condition
  • BPTT = discrete Pontryagin principle - backpropagation as adjoint equations

Вопросы для размышления

  • Why is the HJB equation a PDE of dimension d+1 and how does this relate to the curse of dimensionality in RL?
  • What is the fundamental difference between LQR and model predictive control (MPC)?
  • How is BPTT for an RNN a special case of Pontryagin's maximum principle?

Связанные уроки

  • sp-24-levy-processes — Controlled SDEs driven by Levy noise
  • sp-20 — Ito's formula is used to derive the HJB equation
  • sp-26-spde — The Zakai equation is an SPDE arising from nonlinear filtering
Impulse Control and Continuous-State MDPs

0

1

Sign In