Game Theory

Sequential decisions: the Bellman equation and MDPs

A single decision under uncertainty falls under expected utility. A long chain of dependent decisions explodes combinatorially. The Bellman recursion compresses that chain into a fixed-point computation and remains the algorithmic backbone of reinforcement learning sixty years later.

**AlphaGo Zero (2017):** value and policy networks trained by Monte Carlo tree search are direct descendants of value and policy iteration
**Inventory and supply-chain control:** Walmart and Amazon solve large-scale MDPs to set replenishment thresholds across millions of SKUs
**RLHF for language models:** the policy-gradient stage of ChatGPT optimizes a discounted return on a reward model, an MDP in disguise

Предварительные знания

Decision theory: expected utility

The Bellman equation and value iteration

Richard Bellman framed the dynamic programming principle in 1957 to plan multi-stage missile trajectories at RAND. A 20-step decision chain with 10 actions per step has 10^20 trajectories. Bellman's recursion replaced exhaustive search with a fixed-point iteration that converges in O(log(1/eps)/log(1/gamma)) steps and underlies every modern RL algorithm, including the value networks behind AlphaGo Zero (2017).

Why is value iteration guaranteed to converge to a unique fixed point V*?

Policy iteration and the link to reinforcement learning

Ronald Howard introduced policy iteration in 1960. On the Tetris benchmark of Bertsekas and Tsitsiklis (1996), 22 sweeps of policy iteration matched the score of 200 sweeps of value iteration. The same alternation of evaluation and improvement reappears in actor-critic algorithms and in the RLHF pipeline that fine-tuned ChatGPT in 2022.

What is the main theoretical advantage of policy iteration over value iteration?

Q-learning and modern reinforcement learning

Q-learning (Watkins 1989) is the model-free counterpart to value iteration: instead of knowing the transition kernel P(s'|s, a) the agent estimates Q(s, a) from sampled trajectories. DeepMind's 2015 DQN paper combined Q-learning with a convolutional network and achieved human-level play on 49 Atari games.

Industrial applications: AlphaGo (2016, Q-learning + tree search), self-driving (Wayve, Waymo), data-center cooling (DeepMind reduced Google data-center cooling energy by 40%), and chip placement (Google TPU v5 floorplan generated via RL).

What is the main advantage of Q-learning over value iteration?

Value iteration requires the full MDP model (P, R). Q-learning is model-free: it updates Q(s, a) from sampled transitions (s, a, r, s'), without any knowledge of P. This makes it applicable to environments where the model is unknown or too complex to enumerate - games, robotics, real-world control.

Key points

**MDP:** the tuple (S, A, P, r, gamma) formalizes sequential decisions with the Markov property
**Bellman optimality:** V* satisfies V* = max_a [r + gamma P V*]; the operator T* is a gamma-contraction
**Value iteration:** geometric convergence at rate gamma; O(log(1/eps)/log(1/gamma)) sweeps
**Policy iteration:** finite termination via monotone improvement; one matrix solve per sweep
**Bridge to RL:** Q-learning replaces the model with sampled transitions; actor-critic combines both updates

Вопросы для размышления

How does the discount factor gamma trade off planning horizon against convergence speed?
Why does the Bellman operator remain a contraction in the action-value formulation Q*?
Where do the assumptions of the contraction argument break down in deep RL with function approximation?

Связанные уроки

prob-17

The Bellman equation and value iteration

Why is value iteration guaranteed to converge to a unique fixed point V*?

Policy iteration and the link to reinforcement learning

What is the main theoretical advantage of policy iteration over value iteration?

Q-learning and modern reinforcement learning

What is the main advantage of Q-learning over value iteration?

Key points

**MDP:** the tuple (S, A, P, r, gamma) formalizes sequential decisions with the Markov property

**Bellman optimality:** V* satisfies V* = max_a [r + gamma P V*]; the operator T* is a gamma-contraction

**Value iteration:** geometric convergence at rate gamma; O(log(1/eps)/log(1/gamma)) sweeps

**Policy iteration:** finite termination via monotone improvement; one matrix solve per sweep

**Bridge to RL:** Q-learning replaces the model with sampled transitions; actor-critic combines both updates