Automata and Cognition

Markov Decision Process

Цели урока

Understand how MDP differs from Markov chains: the agent influences transitions through actions
Know MDP components: S, A, P, R, gamma and the role of the discount factor
Understand the Bellman optimality equation and why it is solved iteratively
Distinguish Value Iteration, Policy Iteration and Q-Learning by applicability

Предварительные знания

Markov Chains (aut-02): transition matrices, stationary distributions
HMM (aut-03): hidden states, Viterbi algorithm
Basic probability and mathematical expectation

March 2016. Lee Sedol, 18-time world Go champion, sits across from a laptop running AlphaGo. After losing three straight games, he says: 'I thought I could always win at least one.' He loses the fourth too. The program had played 40 million games against itself, learning through a formal mathematical structure called a Markov Decision Process. MDP is the backbone of every RL system built since - DQN clearing Atari in 2013, AlphaStar defeating StarCraft pros in 2019, ChatGPT's RLHF training. Every token ChatGPT generates is an MDP step.

**AlphaGo (DeepMind, 2016)** - defeated world champion Lee Sedol 4:1, learning through 40M self-play games via MDP
**DQN (DeepMind, 2013)** - Q-Learning plus neural network, cleared 49 Atari games at human level without knowing the rules
**ChatGPT RLHF** - training via a reward model is MDP: model as agent, context as state, tokens as actions
**Tesla Autopilot** - real-time vehicle control through policy learned from MDP
**Recommendation systems (YouTube, Netflix)** - which content to show next: MDP where state is user history

Bellman and the curse of dimensionality

In 1957 Richard Bellman, working at RAND Corporation on optimisation problems for the US Air Force, published 'Dynamic Programming'. He described the recursive decomposition of optimisation tasks - the Bellman optimality principle - and himself named the central problem the 'curse of dimensionality': if a system has 10 parameters each taking 10 values, the state space is 10^10. This problem remains relevant today, and neural network function approximators in Deep RL are built precisely to fight it.

From observer to agent

**DeepMind AlphaGo, 2016: the program played 40 million games of Go against itself and defeated world champion Lee Sedol 4:1.** Markov chains and HMMs described the world and inferred hidden states. AlphaGo did something different - at every position it chose a move. That is the essence of MDP: not observation, but decision-making.

Model	Agent role	Example
Markov Chain	Passive - only observes	Weather changes on its own
HMM	Passive - infers the hidden state	Doctor diagnoses from symptoms
MDP	Active - chooses actions	AlphaGo picks a move in a position

**MDP (Markov Decision Process)** is a 5-tuple (S, A, P, R, gamma): S - set of states, A - set of actions, P(s'|s,a) - transition function (depends on the action!), R(s,a,s') - reward function, gamma - discount factor. In a Markov chain the transition P(s'|s) depends only on the state. In an MDP it depends on both the state and the chosen action.

At each step: the agent observes state s, picks action a, the environment transitions to s' with probability P(s'|s,a) and emits reward R(s,a,s'). The goal is to find a strategy for choosing actions that maximises total accumulated reward.

An MDP is just a Markov chain with a larger number of states

An MDP adds actions and rewards: the agent influences the trajectory and optimises total reward

In a Markov chain the trajectory is fully determined by the initial state and the transition matrix. In an MDP the agent chooses an action at every state, and that choice changes the transition probabilities. This is a fundamental difference - an optimisation problem instead of a description problem.

What distinguishes an MDP from a Markov chain?

Policy and state value

**A policy pi** is a function that determines which action to choose in each state. Deterministic policy: pi(s) = a. Stochastic policy: pi(a|s) = P(A=a|S=s). The goal of an MDP is to find the optimal policy pi* that maximises expected total reward.

Value Function - how good is a state

**Value Function V^pi(s)** - expected total discounted reward when starting from state s and following policy pi. The discount factor gamma sets the planning horizon: with gamma close to 0 the agent is greedy (takes immediate reward), with gamma close to 1 it is far-sighted (optimises long-term outcome).

gamma	Horizon	Example application
gamma = 0.1	Greedy, 1-2 steps ahead	Trading bot on second-level ticks
gamma = 0.9	Moderate, ~10 steps	Robot navigation in a maze
gamma = 0.99	Far-sighted, ~100 steps	AlphaGo (a game lasts 200+ moves)
gamma = 1.0	Infinite horizon	Only for episodic tasks with guaranteed termination

At gamma = 0.99 a reward 100 steps away weighs 0.99^100 = 0.37 of an immediate one. After 500 steps - 0.006. Gamma controls how far ahead the agent plans.

Two agents: gamma=0.1 and gamma=0.99. Paths: (A) immediate reward +5, then 0; (B) +20 in 2 steps. What does each choose?

The Bellman equation and Q-function

**Richard Bellman, RAND Corporation, 1957.** Solving optimisation problems for the US Air Force, he noticed a recursive structure: the value of a state is expressed through the values of subsequent states. This observation became the foundation of all reinforcement learning.

**Bellman optimality equation for V**: V*(s) = max_a SUM_{s'} P(s'|s,a) * [R(s,a,s') + gamma * V*(s')]. Breakdown: choose the best action (max_a), average over possible transitions (SUM), collect immediate reward plus discounted value of the next state.

**Q-function Q^pi(s,a)** - the value of choosing a specific action a in state s and then following policy pi. Relation to V: V^pi(s) = SUM_a pi(a|s) * Q^pi(s,a). The optimal policy is trivial once Q* is known: pi*(s) = argmax_a Q*(s,a). That is why Q-Learning became the backbone of modern RL.

The Bellman equation is a system of |S| equations with |S| unknowns. V*(s) depends on V*(s'), which depends on V*(s''). There is no closed-form solution - iteration is required. Bellman himself called this problem the 'curse of dimensionality': with 10 features each taking 10 values the state space is 10^10.

The Bellman equation provides a formula for directly computing the optimal policy

The Bellman equation defines an optimality condition - a system of equations solved by iterative algorithms

V*(s) = max_a SUM P * [R + gamma * V*(s')] has V* on both sides. This is not a formula but an equation with unknown V*. It is proved that iterative application of the Bellman operator converges to the unique solution (contraction mapping theorem, gamma < 1).

Why is the Bellman equation solved iteratively rather than directly?

Algorithms for solving MDPs

**Three classical approaches to solving MDPs, each used in real systems.** Value Iteration underpins planning in robotics. Policy Iteration is used inside AlphaGo. Model-free Q-Learning is the heart of DQN by DeepMind (2013), which was the first to complete 49 Atari games at human level.

Algorithm	Requires model P, R?	When to use
Value Iteration	Yes (model-based)	Model is known, small state space
Policy Iteration	Yes (model-based)	Converges faster than VI, AlphaGo planning
Q-Learning	No (model-free)	Model unknown, agent learns from experience - Atari, robots

**Exploration vs Exploitation:** Q-Learning uses an epsilon-greedy strategy: with probability epsilon a random action is chosen (exploration), with probability 1-epsilon the best known action is chosen (exploitation). Without exploration the agent gets stuck in a local optimum. DeepMind DQN started with epsilon=1.0 and gradually decayed it to 0.1 over 1 million steps.

**MDP limitations:** 1. full observability - the agent knows the state exactly. When the agent is uncertain about the state a POMDP is needed. 2. Markov property - the future depends only on the current state. 3. discrete state and action spaces. For continuous spaces (robot control) neural networks approximate V and Q functions (Deep RL).

Q-Learning is used when the transition function P(s'|s,a) is unknown. How does it manage without it?

Вопросы для размышления

A delivery robot navigating a city: state - position, battery level, current order; actions - move in 4 directions, charge, accept order. How should the reward function R be designed so the robot does not sacrifice safety for delivery speed?

Связанные уроки

aut-03-hmm — HMM is the precursor to MDP: hidden states, but no agent actions
rl-01 — MDP is the mathematical foundation of all reinforcement learning
rl-04 — Dynamic programming - Value Iteration and Policy Iteration operate directly on MDPs
aut-05-pomdp — POMDP extends MDP to partial observability of state
ml-48-rl-intro — Practical MDP applications in modern RL with neural network approximators
alg-21-dp