Reinforcement Learning

Markov Decision Processes (MDP)

Sutton and Barto: Temporal-Difference Learning

In 1988, Richard Sutton and Andrew Barto published the foundational work on Temporal-Difference Learning - a learning algorithm that learns from the difference between predictions at successive states. Their MDP formalism became the standard for RL. Their textbook "Reinforcement Learning: An Introduction" (1998, 2nd ed. 2018) remains the primary reference for the field.

Цели урока

Understand the Markov property and be able to verify it for real tasks
Distinguish discrete and continuous state and action spaces
Know the full MDP definition as a tuple (S, A, P, R, gamma)
Understand the role of the discount factor and how to choose its value

2016. AlphaGo beat Lee Sedol 4:1. The task: 10^170 board states - more than atoms in the universe. The solution: MCTS plus policy gradient on the MDP formalism. The same self-play trains Boston Dynamics robots to walk. DeepMind in 2022 applied MDP to plasma control in a tokamak. One formalism - infinitely different tasks.

**AlphaGo / AlphaZero (DeepMind)** - MDP with 10^170 board states, PPO plus MCTS solves via self-play
**Boston Dynamics Spot** - locomotion as a continuous MDP: state = 20 joint angles plus velocities, actions = joint torques
**DeepMind Tokamak (2022)** - plasma control in a fusion reactor as MDP: 90 state variables, 19 control coils
**Netflix recommendations** - sequence of user choices as MDP: state = viewing history, action = recommend a film
**OpenAI Dota 2** - OpenAI Five: multi-agent MDP with partial observability, 20,000 actions per timestep

Предварительные знания

Introduction to RL: Agent and Environment

State space and the Markov property

2016. AlphaGo. A board position with 10^170 possible continuations. AlphaGo does not remember all previous moves - only the current position. That is sufficient. This is the intuition behind the **Markov property**: the future depends only on the present, not on the history that led to it.

**Markov property:** P(s_{t+1} | s_t, a_t) = P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, ..., s_0, a_0). The transition probability depends only on the current state and action - all history is already encoded in s_t.

**State space S** - the set of all possible situations the agent can be in. Discrete (finite number of board cells) or continuous (robot position and velocity). The choice determines the algorithm: lookup table vs neural network.

Task	State	Markov?	Why
Chess	Position of all pieces	Yes	Position fully determines all possible moves
Poker	Cards + table	No	Discarded cards affect probabilities
CartPole	[x, v, theta, omega]	Yes	Position + velocity fully determine the physics
Autopilot (camera)	Current frame	No	A single frame does not show speed and direction

If a task is not Markovian, it can often be fixed by expanding the state. Autopilot: a stack of the last 4 frames - speed and direction become computable. Poker: add the betting history to the state. DeepMind does exactly this for Atari - 4 frames instead of 1.

**Discrete vs continuous** - a fundamental distinction. For discrete states, Q-tables are feasible. For continuous states, approximators - neural networks - are required. This is where Deep RL originates: DQN, PPO, SAC.

Which task does NOT satisfy the Markov property under a naive state definition?

Action space

**Action space A** - the set of all actions available to the agent. Discrete actions: Atari - 18 joystick buttons. Continuous actions: a Boston Dynamics manipulator with 20 degrees of freedom, each with torque ranging from -100 to 100 Nm. The dimensionality of the action space determines which algorithm is applicable.

Type	Example task	Actions	Dimensionality
Discrete	Atari Breakout	{up, down, left, right, fire}	\|A\| = 5
Discrete	CartPole	{left, right}	\|A\| = 2
Continuous	Robot arm	Rotation angles of 6 joints	A subset of R^6
Mixed	StarCraft	Unit selection (discrete) + coordinates (continuous)	Combination

An important technique is **action masking**: not all actions are valid in every state. In chess, a knight cannot move to a square occupied by a friendly piece. Masking is more efficient than penalizing invalid actions. AlphaGo Zero applies masks for all legal moves at every MCTS step.

**Action space dimensionality** critically affects training difficulty. CartPole with 2 actions learns in minutes. A robot with 20 continuous degrees of freedom is orders of magnitude harder. This is the **curse of dimensionality** in action space.

For continuous actions a Q-table is not feasible - algorithms like DDPG, SAC, or PPO are needed. For discrete actions, classical Q-learning and DQN work well. SAC is especially effective for robotics: it maximizes both reward and policy entropy.

Why is action masking preferable to penalizing invalid actions?

Transition function

**Transition function P(s'|s, a)** - the probability of reaching state s' when in state s and taking action a. This is the environment's "physics". A manipulator pushes an object: P(s'|s, a) is governed by Newton's laws. A financial market: P(s'|s, a) includes market noise, reactions of other participants, and macroeconomic events.

Environments can be **deterministic** (one action always leads to the same state) or **stochastic** (the result is random). The real world is almost always stochastic: a ball throw depends on the wind, a steering input depends on road conditions, a user response depends on their mood.

For small discrete environments the transition function can be written as a **transition matrix**. For each action a - a matrix of size |S| x |S|, where element [i][j] = P(s_j | s_i, a).

The full MDP specification is a tuple **(S, A, P, R, gamma)**: state space, action space, transition function, reward function, and discount factor. Given all 5 components, the problem can be solved mathematically via dynamic programming.

**Model-based vs Model-free RL.** If the agent knows P(s'|s,a) - that is model-based. If not - model-free. Most practical RL algorithms (Q-learning, PPO, SAC) are model-free: they learn from experience without knowing the transitions. The real world is too complex to model explicitly.

In the stochastic Frozen Lake environment, the agent chooses the action 'right', but with probability 1/3 slides down. What describes this situation?

Discount factor and planning horizon

**Discount factor gamma in [0, 1)** is not simply the agent's "impatience". It has a rigorous mathematical justification: it guarantees that the sum of rewards **converges** for infinite episodes. Without it, comparing two infinite reward sums is impossible - both diverge.

Gamma defines the agent's **effective planning horizon**. Rule of thumb: rewards beyond 1/(1-gamma) steps contribute negligibly.

gamma	Effective horizon	Steps until reward is less than 1%	Application
0.9	~10 steps	44 steps	Short episodes, fast games
0.95	~20 steps	90 steps	Medium-length tasks
0.99	~100 steps	459 steps	Long-term planning
0.999	~1000 steps	6905 steps	Very long episodes

**Formal MDP definition:** a tuple (S, A, P, R, gamma), where S is the state space, A is the action space, P: S x A x S → [0,1] is the transition function, R: S x A → R is the reward function, gamma in [0,1) is the discount factor. Every RL task - from a gridworld to controlling a tokamak - is an MDP with five components.

Gamma is a hyperparameter set manually. In practice, start with gamma=0.99; if training is unstable, reduce it. Too small a gamma: a short-sighted agent (AlphaGo with gamma=0.5 would not win games). Too large: slow convergence.

The full language for describing RL tasks is now in place. Any environment - from a gridworld to plasma control in a nuclear fusion reactor (DeepMind and Tokamak, 2022) - is an MDP with five components. The next lesson covers Bellman equations for solving MDPs.

MDP requires the agent to know the transition function P(s'|s,a) - without it RL is impossible

Most practical RL algorithms are model-free. They do not know P(s'|s,a) and learn directly from experience (by observing transitions s, a → s', r).

Model-free RL (Q-learning, SARSA, PPO) does not need knowledge of the environment's dynamics. The agent simply collects experience - tuples (s, a, r, s') - and updates its policy. This makes RL applicable to tasks where the environment model is unknown or too complex (the real world, complex simulations, biological experiments).

With gamma = 0.95, what is the approximate effective planning horizon of the agent?

Key ideas

**Markov property** - the future depends only on the current state, not history. AlphaGo does not remember all moves - only the current position
**State and action spaces** can be discrete (Q-table) or continuous (neural network). This determines the algorithm choice
**Transition function P(s'|s,a)** describes the environment's physics. Stochasticity is the norm: the real world is slippery, like Frozen Lake
**Discount factor gamma** - a mathematical necessity for finite reward sums. Planning horizon is approximately 1/(1-gamma)
**MDP = (S, A, P, R, gamma)** - the complete language for any RL task: from a gridworld to a tokamak
Sutton and Barto (1988) established TD-Learning and the MDP formalism, without which there would be no AlphaGo or ChatGPT RLHF
Model-free RL does not require knowing P(s'|s,a) - most production algorithms (PPO, SAC, DQN) are model-free

Вопросы для размышления

Model a typical working day as an MDP: what are the states, actions, transitions, and rewards? Is this task Markovian?
Why can gamma = 1.0 not be used for infinite episodes, but can be used for finite ones (e.g., a chess game)?
What real-world tasks are difficult to formalize as an MDP? Consider situations where the definition of state is ambiguous.

Связанные уроки

rl-01 — Intuitive concepts of agent, environment, and reward
rl-03 — Bellman equations - the next step after MDP formalization
prob-17 — Markov chains are the math behind the Markov property
prob-03-conditional — P(s'|s,a) is a conditional probability from probability theory
alg-21-dp — Dynamic programming solves MDPs via Bellman equations
prob-04-bayes — Bayesian inference is used in model-based RL

Reinforcement Learning

Markov Decision Processes (MDP)

Sutton and Barto: Temporal-Difference Learning

Цели урока

Understand the Markov property and be able to verify it for real tasks
Distinguish discrete and continuous state and action spaces
Know the full MDP definition as a tuple (S, A, P, R, gamma)
Understand the role of the discount factor and how to choose its value

**AlphaGo / AlphaZero (DeepMind)** - MDP with 10^170 board states, PPO plus MCTS solves via self-play
**Boston Dynamics Spot** - locomotion as a continuous MDP: state = 20 joint angles plus velocities, actions = joint torques
**DeepMind Tokamak (2022)** - plasma control in a fusion reactor as MDP: 90 state variables, 19 control coils
**Netflix recommendations** - sequence of user choices as MDP: state = viewing history, action = recommend a film
**OpenAI Dota 2** - OpenAI Five: multi-agent MDP with partial observability, 20,000 actions per timestep

Предварительные знания

Introduction to RL: Agent and Environment

State space and the Markov property

Task	State	Markov?	Why
Chess	Position of all pieces	Yes	Position fully determines all possible moves
Poker	Cards + table	No	Discarded cards affect probabilities
CartPole	[x, v, theta, omega]	Yes	Position + velocity fully determine the physics
Autopilot (camera)	Current frame	No	A single frame does not show speed and direction

Which task does NOT satisfy the Markov property under a naive state definition?

Action space

Type	Example task	Actions	Dimensionality
Discrete	Atari Breakout	{up, down, left, right, fire}	\|A\| = 5
Discrete	CartPole	{left, right}	\|A\| = 2
Continuous	Robot arm	Rotation angles of 6 joints	A subset of R^6
Mixed	StarCraft	Unit selection (discrete) + coordinates (continuous)	Combination

Why is action masking preferable to penalizing invalid actions?

Transition function

For small discrete environments the transition function can be written as a **transition matrix**. For each action a - a matrix of size |S| x |S|, where element [i][j] = P(s_j | s_i, a).

In the stochastic Frozen Lake environment, the agent chooses the action 'right', but with probability 1/3 slides down. What describes this situation?

Discount factor and planning horizon

Gamma defines the agent's **effective planning horizon**. Rule of thumb: rewards beyond 1/(1-gamma) steps contribute negligibly.

gamma	Effective horizon	Steps until reward is less than 1%	Application
0.9	~10 steps	44 steps	Short episodes, fast games
0.95	~20 steps	90 steps	Medium-length tasks
0.99	~100 steps	459 steps	Long-term planning
0.999	~1000 steps	6905 steps	Very long episodes

MDP requires the agent to know the transition function P(s'|s,a) - without it RL is impossible

Most practical RL algorithms are model-free. They do not know P(s'|s,a) and learn directly from experience (by observing transitions s, a → s', r).

With gamma = 0.95, what is the approximate effective planning horizon of the agent?

Key ideas

**Markov property** - the future depends only on the current state, not history. AlphaGo does not remember all moves - only the current position
**State and action spaces** can be discrete (Q-table) or continuous (neural network). This determines the algorithm choice
**Transition function P(s'|s,a)** describes the environment's physics. Stochasticity is the norm: the real world is slippery, like Frozen Lake
**Discount factor gamma** - a mathematical necessity for finite reward sums. Planning horizon is approximately 1/(1-gamma)
**MDP = (S, A, P, R, gamma)** - the complete language for any RL task: from a gridworld to a tokamak
Sutton and Barto (1988) established TD-Learning and the MDP formalism, without which there would be no AlphaGo or ChatGPT RLHF
Model-free RL does not require knowing P(s'|s,a) - most production algorithms (PPO, SAC, DQN) are model-free

Вопросы для размышления

Model a typical working day as an MDP: what are the states, actions, transitions, and rewards? Is this task Markovian?
Why can gamma = 1.0 not be used for infinite episodes, but can be used for finite ones (e.g., a chess game)?
What real-world tasks are difficult to formalize as an MDP? Consider situations where the definition of state is ambiguous.

Связанные уроки

rl-01 — Intuitive concepts of agent, environment, and reward
rl-03 — Bellman equations - the next step after MDP formalization
prob-17 — Markov chains are the math behind the Markov property
prob-03-conditional — P(s'|s,a) is a conditional probability from probability theory
alg-21-dp — Dynamic programming solves MDPs via Bellman equations
prob-04-bayes — Bayesian inference is used in model-based RL

Markov Decision Processes (MDP)

Sutton and Barto: Temporal-Difference Learning

Цели урока

Предварительные знания

State space and the Markov property

Action space

Transition function

Discount factor and planning horizon

Key ideas

Related topics

Вопросы для размышления

Связанные уроки

Markov Decision Processes (MDP)

Sutton and Barto: Temporal-Difference Learning

Цели урока

Предварительные знания

State space and the Markov property

Action space

Transition function

Discount factor and planning horizon

Key ideas

Related topics

Вопросы для размышления

Связанные уроки