Differential Equations
Optimal Control
Цели урока
- Formulate an optimal control problem and construct the Hamiltonian
- Apply the Pontryagin maximum principle and derive bang-bang control
- Solve the linear-quadratic problem via the Riccati equation (LQR)
- Understand how the HJB equation connects to the value function in reinforcement learning
Предварительные знания
- ODE systems
- Calculus of variations
- Linear algebra
How does SpaceX find a minimum-fuel rocket trajectory in 60 seconds during a live landing attempt?
- SpaceX Falcon 9: Pontryagin principle for first-stage landing, 232/234 successes in 2023
- Tesla Autopilot and ABB industrial robots use MPC in real time
- DeepMind controls tokamak plasma via RL - a direct analog of Bellman's optimality principle
- ChatGPT is trained with RLHF - reinforcement learning from human feedback, mathematically the same problem
Pontryagin, Bellman, and the Arms Race
Lev Pontryagin lost his sight at 14 in a stove explosion. By 1956, working in the Soviet Union, he had proved the maximum principle - a landmark of control theory. Independently and simultaneously, Richard Bellman in the United States was developing dynamic programming and the HJB equation. Both worked under the pressure of the arms race: missile guidance and submarine control required mathematical theory. Today their results form the foundation of reinforcement learning - the central tool of modern AI.
Pontryagin Maximum Principle
SpaceX uses the Pontryagin principle to land Falcon 9 first stages: within 60 seconds the algorithm finds a minimum-fuel throttle trajectory. In 2023, 232 out of 234 landing attempts succeeded. The maximum principle is a first-order necessary condition for optimality expressed through an auxiliary costate variable - the infinite-dimensional analog of setting a gradient to zero.
The Pontryagin principle (1956) generalizes the calculus of variations to problems with control constraints. The minimum-time lemma - a special case with L = 1 and a compact control set U - underpins time-optimal rocket guidance.
What is the costate variable p(t) in the Pontryagin principle?
Hamilton-Jacobi-Bellman Equation
Bellman's optimality principle (1957) offers a different perspective: the value function V(x,t) - the minimum future cost from state x at time t - satisfies a nonlinear PDE. This is the HJB equation. For the linear-quadratic problem it reduces to the Riccati equation, the cornerstone of modern control theory.
Curse of dimensionality: the HJB equation in R^n is a PDE in n+1 dimensions. Grid-based solvers require O(N^n) nodes - exponential in the state dimension. Remedies: deep learning for value function approximation (DeepBSDE, Han et al. 2018) or variational methods.
Why does the HJB equation suffer from the curse of dimensionality?
Connection to Reinforcement Learning and Algorithms
Reinforcement learning is numerical optimal control. The Q-function in RL is the discrete analog of the value function V from the HJB equation. Policy gradient is a numerical approximation to the gradient with respect to control parameters. AlphaGo, ChatGPT (RLHF), and Tesla Autopilot all operate within this mathematical framework.
| Optimal Control | Reinforcement Learning | Meaning |
|---|---|---|
| State x(t) | State s_t | Current system configuration |
| Control u(t) | Action a_t | Algorithm decision |
| Running cost L(x,u) | Reward r_t = -L | Step quality signal |
| Value function V(x,t) | Value function V(s) | Optimal future reward |
| HJB equation | Bellman equation | Recursive optimality |
| Pontryagin principle | Policy gradient | First-order optimality condition |
Model-Predictive Control (MPC) is the online variant: at each time step an optimal control problem over a horizon T is solved, the first step is applied, and the horizon shifts forward. Tesla Autopilot, ABB industrial robots, and aircraft autopilots all run on MPC.
DeepMind and Plasma Control in a Tokamak
DeepMind and Swiss Plasma Center (2022)
DeepMind and the Swiss Plasma Center trained an RL agent to control the shape of plasma in the TCV tokamak. The control task: 19 electromagnets, 92 control parameters, plasma at 100 million degrees. Traditional MPC requires 600+ hours of engineering effort for each new plasma regime; the RL agent adapts in hours. The work was published in Nature in 2022.
How does MPC solve the optimal control problem in real time?
Connections to Other Areas
Optimal control bridges differential equations, the calculus of variations, and modern machine learning.
- Reinforcement Learning — Related topic
- Model-Predictive Control — Related topic
- Riccati Equation — Related topic
- Calculus of Variations — Related topic
Итоги
- Maximum principle: optimal control maximizes H(x,p,u) at every instant; costate p satisfies a time-reversed ODE
- Bang-bang control is optimal whenever the Hamiltonian is linear in u and U is a bounded interval
- HJB is a nonlinear PDE for the value function V; for LQR it reduces to the algebraic Riccati equation
- RL, MPC, and policy gradient are numerical implementations of the same optimality principles
Вопросы для размышления
- Why is bang-bang control optimal for problems with a Hamiltonian that is linear in u?
- How does the curse of dimensionality limit direct HJB solutions, and what does deep learning do about it?
- What is the mathematical difference between the Pontryagin principle (necessary) and the HJB equation (necessary and sufficient)?
Связанные уроки
- de-25-fem — FEM solves the HJB PDE for distributed-parameter control problems
- de-29-einstein-equations — Geodesics in GR are optimal control problems with minimum action
- de-23-pde-bvp — HJB is a nonlinear PDE requiring boundary conditions