Reinforcement Learning

Dynamic Programming in RL

1957. Bellman at RAND Corporation solves how to optimally control missiles. No AI, no neural networks - pure optimization mathematics. Policy Iteration and Value Iteration, developed then for missiles, underlie AlphaGo, ChatGPT RLHF, and Tesla Autopilot today. Understanding Dynamic Programming means understanding the mathematics modern RL is built on.

**AlphaGo**: Policy Iteration via MCTS + neural evaluation. Known environment model (Go rules) makes model-based planning possible
**GPS navigation**: Value Iteration on a road graph. Shortest path is a special case of optimal policy in a deterministic MDP
**Power grid management**: Policy Iteration for load distribution optimization with a known network model

Historical context

In 1957 Richard Bellman published 'Dynamic Programming' at RAND Corporation. The Bellman optimality principle states: an optimal policy has the property that whatever the initial state and initial decision, the remaining decisions must constitute an optimal policy with respect to the state resulting from the first decision. This same year Bellman formalized value iteration and the Bellman equation. Bellman also coined the term 'curse of dimensionality' to describe the exponential growth in complexity with the number of variables. In 1960 Ronald Howard introduced policy iteration in 'Dynamic Programming and Markov Processes', the second pillar of DP for RL. These DP limitations remain relevant today.

Предварительные знания

Bellman equation and the value function V(s), Q(s,a)
Markov decision process: states, actions, transitions P, reward R, gamma
Iterative algorithms and convergence intuition

Policy Iteration: evaluate and improve

**1957. RAND Corporation.** Richard Bellman is solving optimal missile control for the US Air Force. Key observation: an optimal policy can be found by alternating two steps - evaluating the current policy and improving it. This is Policy Iteration - the first exact algorithm for MDPs, guaranteed to converge to the optimum.

**Policy Iteration - two alternating steps:** 1. **Policy Evaluation** - compute V^pi(s) for the current policy pi. Solve the system of linear equations iteratively: until |V_new - V_old| > theta. 2. **Policy Improvement** - improve the policy: for each state, choose the action that maximizes Q^pi(s,a). The algorithm converges in a finite number of iterations (the number of possible policies = |A|^|S| is finite). Typically 5-20 iterations even for large MDPs.

**Policy Evaluation is expensive.** Full computation of V^pi requires many sweeps through all states until convergence. Modified Policy Iteration runs only k evaluation steps (k=1 is Value Iteration). In practice, Modified Policy Iteration with k=3-10 is often faster than either extreme.

Policy Iteration is guaranteed to converge to the optimal policy. In how many iterations?

Value Iteration: direct optimization

Policy Iteration alternates two steps. Value Iteration asks: why bother evaluating the policy separately at all? Just apply the Bellman optimality operator to the value function directly, without tracking an explicit policy. Mathematically it is one step of Policy Evaluation plus Policy Improvement, repeated until convergence.

Method	Iterations	Cost per iteration	Convergence
Policy Iteration	Few (~5-20)	Expensive (full eval)	Finite (exact)
Value Iteration	Many (hundreds)	Cheap (one sweep)	Asymptotic
Modified PI (k=5)	Moderate	Moderate	Finite

**When is Value Iteration better than Policy Iteration?** When the state space is large: Policy Evaluation solves a system of |S| equations - O(|S|^3) for an exact solution. Value Iteration does one sweep in O(|S| * |A|). At |S| = 10^6 Policy Evaluation is simply impossible; Value Iteration is the standard choice.

Value Iteration updates V(s) = max_a Q(s,a) on each iteration. Why does the algorithm converge when V* is unknown in advance?

Convergence: mathematical guarantees

Why do we know these algorithms work? The mathematics behind DP convergence in RL is Banach's fixed-point theorem applied to the Bellman operator. Understanding this theorem explains not just why DP works, but also why Q-Learning and TD methods work in the model-free context.

**Bellman operator T:** (TV)(s) = max_a SUM_s' P(s'|s,a) [R(s,a,s') + gamma * V(s')] **Property 1: monotonicity.** If V(s) <= U(s) for all s, then TV(s) <= TU(s). **Property 2: contraction.** ||TV - TU||_inf <= gamma * ||V - U||_inf **Consequence:** T has a unique fixed point V*, and iterations V_{k+1} = TV_k converge to V* at rate gamma^k * ||V_0 - V*||.

**Stopping criterion theta:** Value Iteration stops when max_s |V_new(s) - V_old(s)| < theta. This guarantees the **current policy error** (not V!) is bounded by 2*gamma*theta/(1-gamma). With theta=1e-6 and gamma=0.9 - policy error <= 1.8e-5. Too small theta = wasted iterations. Too large = suboptimal policy.

Value Iteration with gamma=0.99 converges much slower than with gamma=0.9. Why does this matter when choosing gamma?

Model-based RL: when a model is available

**DP methods (Policy Iteration, Value Iteration) are model-based: they require knowing P(s'|s,a) and R(s,a,s').** In real tasks the model is rarely known in advance. But the model-based approach remains important for two reasons: 1) sometimes the model can be learned (Dyna-Q), 2) it provides the theoretical foundation for model-free methods.

Approach	Needs model?	Complexity	Application
Value Iteration	Yes	O(\|S\|^2 * \|A\|) per iteration	Planning in small MDPs
Policy Iteration	Yes	O(\|S\|^3) per eval	Exact solution when \|S\| < 10^4
Dyna-Q	Builds from experience	Moderate	Combines learning and planning
Q-Learning	No	O(1) per step	Large/unknown state spaces

**AlphaGo uses model-based RL.** The environment model (the game of Go) is fully known - the rules are deterministic. Policy Iteration (Monte Carlo Tree Search) plans ahead from each position using neural networks to evaluate positions (value network) and select moves (policy network). This combines model-based planning with model-free learning from self-play.

Dyna-Q runs n_planning simulated planning steps after each real step. Why?

Dynamic Programming in the RL ecosystem

DP is the exact, model-based core. Every model-free method that follows is an approximation of these same Bellman updates:

Bellman Equations — DP algorithms are the iterative procedures that solve the Bellman optimality equation
Policy Gradient — Contrast: policy gradient optimizes the policy directly without a model P, where DP needs one
TD Learning and Q-Learning — TD methods replace DP's full sweeps over a known model with sampled bootstrapped updates
Markov Decision Processes — DP assumes the full MDP (P, R) is known, which is exactly the model these algorithms exploit

Key ideas

**Policy Iteration** alternates Policy Evaluation (compute V^pi for current pi) and Policy Improvement (choose the best action by Q). Converges in finite iterations. Slow evaluation is the bottleneck
**Value Iteration** applies the Bellman operator directly to V without an explicit policy. One cheap sweep instead of expensive evaluation. Converges asymptotically - requires threshold theta
**Convergence speed** = gamma^k: high gamma -> slow convergence. Stopping criterion theta determines policy error via 2*gamma*theta/(1-gamma)
**Model-based RL** requires P and R but gives exact answers. Dyna-Q builds a model from experience and plans n_planning simulated steps - acceleration without extra real interactions

Вопросы для размышления

Policy Iteration converges in finite iterations, Value Iteration asymptotically. Why is Value Iteration often preferred in practice?
Dyna-Q builds an environment model from accumulated experience. What happens if the environment changes (concept drift)? How should the algorithm adapt?
AlphaGo knows the environment model (Go rules) and uses MCTS as Policy Iteration. Why is the model-based approach impractical for most real-world RL tasks (robot control, trading)?

Связанные уроки

rl-03 — MDP and Bellman equations - the mathematical foundation that DP solves iteratively
rl-05 — Monte Carlo methods - the model-free alternative to DP when the environment is unknown
aut-04-mdp — MDP from the automata section - same material with emphasis on DP algorithms
rl-06 — Temporal Difference learning combines ideas from DP and Monte Carlo
ml-49-q-learning — Q-Learning - practical implementation of DP ideas in the model-free context
alg-21-dp