Differential Equations

Optimal Control

Цели урока

Formulate an optimal control problem and construct the Hamiltonian
Apply the Pontryagin maximum principle and derive bang-bang control
Solve the linear-quadratic problem via the Riccati equation (LQR)
Understand how the HJB equation connects to the value function in reinforcement learning

Предварительные знания

ODE systems
Calculus of variations
Linear algebra

Finite Element Method

How does SpaceX find a minimum-fuel rocket trajectory in 60 seconds during a live landing attempt?

SpaceX Falcon 9: Pontryagin principle for first-stage landing, 232/234 successes in 2023
Tesla Autopilot and ABB industrial robots use MPC in real time
DeepMind controls tokamak plasma via RL - a direct analog of Bellman's optimality principle
ChatGPT is trained with RLHF - reinforcement learning from human feedback, mathematically the same problem

Pontryagin, Bellman, and the Arms Race

Lev Pontryagin lost his sight at 14 in a stove explosion. By 1956, working in the Soviet Union, he had proved the maximum principle - a landmark of control theory. Independently and simultaneously, Richard Bellman in the United States was developing dynamic programming and the HJB equation. Both worked under the pressure of the arms race: missile guidance and submarine control required mathematical theory. Today their results form the foundation of reinforcement learning - the central tool of modern AI.

Pontryagin Maximum Principle

SpaceX uses the Pontryagin principle to land Falcon 9 first stages: within 60 seconds the algorithm finds a minimum-fuel throttle trajectory. In 2023, 232 out of 234 landing attempts succeeded. The maximum principle is a first-order necessary condition for optimality expressed through an auxiliary costate variable - the infinite-dimensional analog of setting a gradient to zero.

The Pontryagin principle (1956) generalizes the calculus of variations to problems with control constraints. The minimum-time lemma - a special case with L = 1 and a compact control set U - underpins time-optimal rocket guidance.

What is the costate variable p(t) in the Pontryagin principle?

Hamilton-Jacobi-Bellman Equation

Bellman's optimality principle (1957) offers a different perspective: the value function V(x,t) - the minimum future cost from state x at time t - satisfies a nonlinear PDE. This is the HJB equation. For the linear-quadratic problem it reduces to the Riccati equation, the cornerstone of modern control theory.

Curse of dimensionality: the HJB equation in R^n is a PDE in n+1 dimensions. Grid-based solvers require O(N^n) nodes - exponential in the state dimension. Remedies: deep learning for value function approximation (DeepBSDE, Han et al. 2018) or variational methods.

Why does the HJB equation suffer from the curse of dimensionality?

Connection to Reinforcement Learning and Algorithms

Reinforcement learning is numerical optimal control. The Q-function in RL is the discrete analog of the value function V from the HJB equation. Policy gradient is a numerical approximation to the gradient with respect to control parameters. AlphaGo, ChatGPT (RLHF), and Tesla Autopilot all operate within this mathematical framework.

Optimal Control	Reinforcement Learning	Meaning
State x(t)	State s_t	Current system configuration
Control u(t)	Action a_t	Algorithm decision
Running cost L(x,u)	Reward r_t = -L	Step quality signal
Value function V(x,t)	Value function V(s)	Optimal future reward
HJB equation	Bellman equation	Recursive optimality
Pontryagin principle	Policy gradient	First-order optimality condition

Model-Predictive Control (MPC) is the online variant: at each time step an optimal control problem over a horizon T is solved, the first step is applied, and the horizon shifts forward. Tesla Autopilot, ABB industrial robots, and aircraft autopilots all run on MPC.

DeepMind and Plasma Control in a Tokamak

DeepMind and Swiss Plasma Center (2022)

DeepMind and the Swiss Plasma Center trained an RL agent to control the shape of plasma in the TCV tokamak. The control task: 19 electromagnets, 92 control parameters, plasma at 100 million degrees. Traditional MPC requires 600+ hours of engineering effort for each new plasma regime; the RL agent adapts in hours. The work was published in Nature in 2022.

How does MPC solve the optimal control problem in real time?

Connections to Other Areas

Optimal control bridges differential equations, the calculus of variations, and modern machine learning.

Reinforcement Learning — Related topic
Model-Predictive Control — Related topic
Riccati Equation — Related topic
Calculus of Variations — Related topic

Итоги

Maximum principle: optimal control maximizes H(x,p,u) at every instant; costate p satisfies a time-reversed ODE
Bang-bang control is optimal whenever the Hamiltonian is linear in u and U is a bounded interval
HJB is a nonlinear PDE for the value function V; for LQR it reduces to the algebraic Riccati equation
RL, MPC, and policy gradient are numerical implementations of the same optimality principles

Вопросы для размышления

Why is bang-bang control optimal for problems with a Hamiltonian that is linear in u?
How does the curse of dimensionality limit direct HJB solutions, and what does deep learning do about it?
What is the mathematical difference between the Pontryagin principle (necessary) and the HJB equation (necessary and sufficient)?

Связанные уроки

de-25-fem — FEM solves the HJB PDE for distributed-parameter control problems
de-29-einstein-equations — Geodesics in GR are optimal control problems with minimum action
de-23-pde-bvp — HJB is a nonlinear PDE requiring boundary conditions

Pontryagin Maximum Principle

What is the costate variable p(t) in the Pontryagin principle?

Hamilton-Jacobi-Bellman Equation

Why does the HJB equation suffer from the curse of dimensionality?

Connection to Reinforcement Learning and Algorithms

Optimal Control

Reinforcement Learning

Meaning

State x(t)

State s_t

Current system configuration

Control u(t)

Action a_t

Algorithm decision

Running cost L(x,u)

Reward r_t = -L

Step quality signal

Value function V(x,t)

Value function V(s)

Optimal future reward

HJB equation

Bellman equation

Recursive optimality

Pontryagin principle

Policy gradient

First-order optimality condition

DeepMind and Plasma Control in a Tokamak

DeepMind and Swiss Plasma Center (2022)

How does MPC solve the optimal control problem in real time?

Итоги

Maximum principle: optimal control maximizes H(x,p,u) at every instant; costate p satisfies a time-reversed ODE

Bang-bang control is optimal whenever the Hamiltonian is linear in u and U is a bounded interval

HJB is a nonlinear PDE for the value function V; for LQR it reduces to the algebraic Riccati equation

RL, MPC, and policy gradient are numerical implementations of the same optimality principles