Reinforcement Learning
Markov Decision Processes (MDP)
Sutton and Barto: Temporal-Difference Learning
In 1988, Richard Sutton and Andrew Barto published the foundational work on Temporal-Difference Learning - a learning algorithm that learns from the difference between predictions at successive states. Their MDP formalism became the standard for RL. Their textbook "Reinforcement Learning: An Introduction" (1998, 2nd ed. 2018) remains the primary reference for the field.
Цели урока
- Understand the Markov property and be able to verify it for real tasks
- Distinguish discrete and continuous state and action spaces
- Know the full MDP definition as a tuple (S, A, P, R, gamma)
- Understand the role of the discount factor and how to choose its value
2016. AlphaGo beat Lee Sedol 4:1. The task: 10^170 board states - more than atoms in the universe. The solution: MCTS plus policy gradient on the MDP formalism. The same self-play trains Boston Dynamics robots to walk. DeepMind in 2022 applied MDP to plasma control in a tokamak. One formalism - infinitely different tasks.
- **AlphaGo / AlphaZero (DeepMind)** - MDP with 10^170 board states, PPO plus MCTS solves via self-play
- **Boston Dynamics Spot** - locomotion as a continuous MDP: state = 20 joint angles plus velocities, actions = joint torques
- **DeepMind Tokamak (2022)** - plasma control in a fusion reactor as MDP: 90 state variables, 19 control coils
- **Netflix recommendations** - sequence of user choices as MDP: state = viewing history, action = recommend a film
- **OpenAI Dota 2** - OpenAI Five: multi-agent MDP with partial observability, 20,000 actions per timestep
Предварительные знания
State space and the Markov property
2016. AlphaGo. A board position with 10^170 possible continuations. AlphaGo does not remember all previous moves - only the current position. That is sufficient. This is the intuition behind the **Markov property**: the future depends only on the present, not on the history that led to it.
**Markov property:** P(s_{t+1} | s_t, a_t) = P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, ..., s_0, a_0). The transition probability depends only on the current state and action - all history is already encoded in s_t.
**State space S** - the set of all possible situations the agent can be in. Discrete (finite number of board cells) or continuous (robot position and velocity). The choice determines the algorithm: lookup table vs neural network.
| Task | State | Markov? | Why |
|---|---|---|---|
| Chess | Position of all pieces | Yes | Position fully determines all possible moves |
| Poker | Cards + table | No | Discarded cards affect probabilities |
| CartPole | [x, v, theta, omega] | Yes | Position + velocity fully determine the physics |
| Autopilot (camera) | Current frame | No | A single frame does not show speed and direction |
If a task is not Markovian, it can often be fixed by expanding the state. Autopilot: a stack of the last 4 frames - speed and direction become computable. Poker: add the betting history to the state. DeepMind does exactly this for Atari - 4 frames instead of 1.
**Discrete vs continuous** - a fundamental distinction. For discrete states, Q-tables are feasible. For continuous states, approximators - neural networks - are required. This is where Deep RL originates: DQN, PPO, SAC.
Which task does NOT satisfy the Markov property under a naive state definition?
Action space
**Action space A** - the set of all actions available to the agent. Discrete actions: Atari - 18 joystick buttons. Continuous actions: a Boston Dynamics manipulator with 20 degrees of freedom, each with torque ranging from -100 to 100 Nm. The dimensionality of the action space determines which algorithm is applicable.
| Type | Example task | Actions | Dimensionality |
|---|---|---|---|
| Discrete | Atari Breakout | {up, down, left, right, fire} | |A| = 5 |
| Discrete | CartPole | {left, right} | |A| = 2 |
| Continuous | Robot arm | Rotation angles of 6 joints | A subset of R^6 |
| Mixed | StarCraft | Unit selection (discrete) + coordinates (continuous) | Combination |
An important technique is **action masking**: not all actions are valid in every state. In chess, a knight cannot move to a square occupied by a friendly piece. Masking is more efficient than penalizing invalid actions. AlphaGo Zero applies masks for all legal moves at every MCTS step.
**Action space dimensionality** critically affects training difficulty. CartPole with 2 actions learns in minutes. A robot with 20 continuous degrees of freedom is orders of magnitude harder. This is the **curse of dimensionality** in action space.
For continuous actions a Q-table is not feasible - algorithms like DDPG, SAC, or PPO are needed. For discrete actions, classical Q-learning and DQN work well. SAC is especially effective for robotics: it maximizes both reward and policy entropy.
Why is action masking preferable to penalizing invalid actions?
Transition function
**Transition function P(s'|s, a)** - the probability of reaching state s' when in state s and taking action a. This is the environment's "physics". A manipulator pushes an object: P(s'|s, a) is governed by Newton's laws. A financial market: P(s'|s, a) includes market noise, reactions of other participants, and macroeconomic events.
Environments can be **deterministic** (one action always leads to the same state) or **stochastic** (the result is random). The real world is almost always stochastic: a ball throw depends on the wind, a steering input depends on road conditions, a user response depends on their mood.
For small discrete environments the transition function can be written as a **transition matrix**. For each action a - a matrix of size |S| x |S|, where element [i][j] = P(s_j | s_i, a).
The full MDP specification is a tuple **(S, A, P, R, gamma)**: state space, action space, transition function, reward function, and discount factor. Given all 5 components, the problem can be solved mathematically via dynamic programming.
**Model-based vs Model-free RL.** If the agent knows P(s'|s,a) - that is model-based. If not - model-free. Most practical RL algorithms (Q-learning, PPO, SAC) are model-free: they learn from experience without knowing the transitions. The real world is too complex to model explicitly.
In the stochastic Frozen Lake environment, the agent chooses the action 'right', but with probability 1/3 slides down. What describes this situation?
Discount factor and planning horizon
**Discount factor gamma in [0, 1)** is not simply the agent's "impatience". It has a rigorous mathematical justification: it guarantees that the sum of rewards **converges** for infinite episodes. Without it, comparing two infinite reward sums is impossible - both diverge.
Gamma defines the agent's **effective planning horizon**. Rule of thumb: rewards beyond 1/(1-gamma) steps contribute negligibly.
| gamma | Effective horizon | Steps until reward is less than 1% | Application |
|---|---|---|---|
| 0.9 | ~10 steps | 44 steps | Short episodes, fast games |
| 0.95 | ~20 steps | 90 steps | Medium-length tasks |
| 0.99 | ~100 steps | 459 steps | Long-term planning |
| 0.999 | ~1000 steps | 6905 steps | Very long episodes |
**Formal MDP definition:** a tuple (S, A, P, R, gamma), where S is the state space, A is the action space, P: S x A x S → [0,1] is the transition function, R: S x A → R is the reward function, gamma in [0,1) is the discount factor. Every RL task - from a gridworld to controlling a tokamak - is an MDP with five components.
Gamma is a hyperparameter set manually. In practice, start with gamma=0.99; if training is unstable, reduce it. Too small a gamma: a short-sighted agent (AlphaGo with gamma=0.5 would not win games). Too large: slow convergence.
The full language for describing RL tasks is now in place. Any environment - from a gridworld to plasma control in a nuclear fusion reactor (DeepMind and Tokamak, 2022) - is an MDP with five components. The next lesson covers Bellman equations for solving MDPs.
MDP requires the agent to know the transition function P(s'|s,a) - without it RL is impossible
Most practical RL algorithms are model-free. They do not know P(s'|s,a) and learn directly from experience (by observing transitions s, a → s', r).
Model-free RL (Q-learning, SARSA, PPO) does not need knowledge of the environment's dynamics. The agent simply collects experience - tuples (s, a, r, s') - and updates its policy. This makes RL applicable to tasks where the environment model is unknown or too complex (the real world, complex simulations, biological experiments).
With gamma = 0.95, what is the approximate effective planning horizon of the agent?
Key ideas
- **Markov property** - the future depends only on the current state, not history. AlphaGo does not remember all moves - only the current position
- **State and action spaces** can be discrete (Q-table) or continuous (neural network). This determines the algorithm choice
- **Transition function P(s'|s,a)** describes the environment's physics. Stochasticity is the norm: the real world is slippery, like Frozen Lake
- **Discount factor gamma** - a mathematical necessity for finite reward sums. Planning horizon is approximately 1/(1-gamma)
- **MDP = (S, A, P, R, gamma)** - the complete language for any RL task: from a gridworld to a tokamak
- Sutton and Barto (1988) established TD-Learning and the MDP formalism, without which there would be no AlphaGo or ChatGPT RLHF
- Model-free RL does not require knowing P(s'|s,a) - most production algorithms (PPO, SAC, DQN) are model-free
Related topics
MDP is the foundation on which all RL algorithms are built:
- Introduction to RL — Intuitive concepts of agent, environment, and reward that MDP formalizes mathematically
- Bellman Equations — Equations for computing the value of states in an MDP - the bridge from model to solution
- Probability Theory — P(s'|s,a) is a conditional probability; discounted return is a mathematical expectation
Вопросы для размышления
- Model a typical working day as an MDP: what are the states, actions, transitions, and rewards? Is this task Markovian?
- Why can gamma = 1.0 not be used for infinite episodes, but can be used for finite ones (e.g., a chess game)?
- What real-world tasks are difficult to formalize as an MDP? Consider situations where the definition of state is ambiguous.
Связанные уроки
- rl-01 — Intuitive concepts of agent, environment, and reward
- rl-03 — Bellman equations - the next step after MDP formalization
- prob-17 — Markov chains are the math behind the Markov property
- prob-03-conditional — P(s'|s,a) is a conditional probability from probability theory
- alg-21-dp — Dynamic programming solves MDPs via Bellman equations
- prob-04-bayes — Bayesian inference is used in model-based RL