Machine Learning
Introduction to Reinforcement Learning
In 2016, AlphaGo defeated the world champion at Go - a game where the number of possible positions exceeds the number of atoms in the universe. No one programmed a winning strategy. The AI learned it on its own by playing millions of games against itself, through trial and error. This is reinforcement learning.
- **Robotics** - Boston Dynamics robots and drones learn to walk, run, and fly through RL, adapting to uneven surfaces and wind gusts without explicit programming of each movement
- **RLHF for LLMs** - ChatGPT, Claude, and other language models use Reinforcement Learning from Human Feedback: the model generates responses, humans rate them, and the RL algorithm (PPO) trains the model to give better answers
- **Data center optimization** - DeepMind applied RL to manage cooling in Google's servers, reducing energy consumption by 40%, saving millions of dollars and tons of CO2
Предварительные знания
- Probability basics: expected value, conditional probability, random variables
- Comfort with functions and recursion, since the Bellman equation is defined recursively
- Neural networks at a high level: enough to follow how Deep RL approximates value functions
The long road to learning by reward
Reinforcement learning did not start with neural networks. Its mathematical spine came from Richard Bellman, who in 1957 formalized dynamic programming and the Markov decision process (MDP), giving us the recursive value equation that still carries his name. The same year he coined the phrase 'curse of dimensionality' for why these methods explode in large state spaces. In 1961 Marvin Minsky, in 'Steps Toward Artificial Intelligence', connected the dots between trial-and-error learning, credit assignment, and intelligent behavior, framing problems RL would spend decades solving. The modern field crystallized around Richard Sutton and Andrew Barto. Sutton's 1988 work on temporal-difference (TD) learning showed how an agent could learn from raw experience by bootstrapping, updating estimates from later estimates without waiting for a final outcome. Their 1998 textbook 'Reinforcement Learning: An Introduction' (substantially revised in 2018) became the field's canonical reference. So when AlphaGo stunned the world in 2016, it stood on theory that was nearly sixty years in the making.
Agent and Environment
Reinforcement Learning (RL) is the third paradigm of machine learning, after supervised and unsupervised learning. In supervised learning, a model learns from labeled examples: "here's a picture of a cat, here's a dog." In unsupervised learning, it finds structure in unlabeled data. In RL, an **agent** interacts with an **environment**: it takes **actions**, observes the **state** of the environment, and receives **rewards**. The agent's goal is to learn to act in a way that **maximizes the total reward over the entire episode**.
Formally, the agent-environment interaction is described by a **Markov Decision Process (MDP)**. An MDP is defined by the tuple (S, A, P, R, gamma), where S is the state space, A is the action space, P(s'|s, a) is the transition probability to state s' given action a in state s, R(s, a) is the reward function, and gamma is the discount factor. The **Markov property** means that the future depends only on the current state, not on the full history: P(s_{t+1} | s_t, a_t) contains all the information needed for prediction.
In practice, RL environments are implemented using the **Gymnasium** library (formerly OpenAI Gym). Gymnasium provides a unified interface for hundreds of environments: from simple tasks (CartPole - balance a pole vertically, FrozenLake - navigate across ice) to complex ones (Atari games, robotics in MuJoCo). The interface is straightforward: `env.reset()` initializes the environment and returns the initial state, `env.step(action)` executes an action and returns the new state, reward, and a termination flag.
**Key differences between RL and Supervised Learning:** - **No teacher.** The agent receives no correct answers. It only gets a numeric reward and must figure out which actions led to it. - **Delayed reward.** An action may yield a reward 100 steps later. A chess move in the opening can determine the outcome 50 moves ahead. - **Sequential decisions.** The current action affects future states. In supervised learning, examples are independent. - **Exploration vs exploitation dilemma.** The agent must balance exploring new actions (exploration) with using already known good actions (exploitation).
What is the essence of the Markov property in the context of MDPs?
Reward and Return
At each step the agent receives an **immediate reward** r_t - a scalar number indicating how good the current action was. But the agent optimizes not individual rewards but the **total return** G_t - the accumulated reward from time t to the end of the episode. The key idea: future rewards are more valuable if they are closer in time. This is handled by the **discount factor gamma** (0 <= gamma <= 1).
Why discount? Three reasons. **Uncertainty:** the future is less predictable - a reward 100 steps away may never arrive. **Mathematical correctness:** without discounting (gamma=1) an infinite sum of rewards may diverge. **Time preference:** a reward now is objectively more valuable because the agent can use it sooner. Useful recursive property: G_t = r_t + gamma * G_{t+1}. This allows computing the return step-by-step without waiting for the episode to end.
Designing the reward function - **reward shaping** - is one of the hardest parts of RL. Rewards can be **dense** (at every step) or **sparse** (only at the end or upon reaching a goal). Dense rewards ease learning: the agent gets constant feedback. Sparse rewards are closer to real-world tasks, but the agent wanders blindly for a long time until it accidentally reaches the goal. Example: in chess, a sparse reward (+1 for win, -1 for loss) provides a signal only at the end of an 80-move game.
**Typical gamma values:** - **gamma = 0** - the agent is fully greedy, optimizing only the immediate reward r_t. No planning ahead. - **gamma = 0.9** - standard choice for short-horizon tasks (tens of steps). A reward 10 steps away = 0.35 of its current value. - **gamma = 0.99** - for long-horizon tasks (hundreds of steps). A reward 100 steps away = 0.37 of its current value. Used in Atari games. - **gamma = 0.999** - for very long episodes (thousands of steps). More stable training, but slower. - **gamma = 1** - no discounting. Works only if episodes are always finite (otherwise G_t can be infinite).
An episode lasts 4 steps with rewards [0, 0, 5, 0]. With gamma = 0.5, what is the return G_0 (from the start of the episode)?
Policy
A **policy** is the agent's strategy - the rule by which it selects actions. Formally, a policy is a mapping from states to actions. Two types exist: a **deterministic** policy a = mu(s) always chooses one action in a given state, while a **stochastic** policy pi(a|s) defines a probability distribution over actions. The goal of RL is to find the **optimal policy** pi* that maximizes the expected total return.
The central dilemma of RL is **exploration vs exploitation**. Exploitation means using the best known action to maximize reward right now. Exploration means trying new actions that might lead to even greater rewards in the future. If the agent only exploits, it gets stuck on a locally good strategy without finding a globally better one. If it only explores, it never takes advantage of what it has learned.
**Epsilon-greedy** is the simplest strategy for balancing exploration and exploitation. With probability (1-epsilon) the agent chooses the best known action (exploitation), and with probability epsilon it chooses a random action (exploration). Typical value: epsilon = 0.1 (10% exploration). Often epsilon is **annealed** over time: at the start of training epsilon = 1.0 (pure exploration), gradually decreasing to epsilon = 0.01 (nearly pure exploitation). This way the agent first explores the environment, then exploits its knowledge.
**Softmax exploration - an alternative to epsilon-greedy:** Instead of choosing a random action with a fixed probability, softmax assigns each action a probability proportional to its value: pi(a|s) = exp(Q(s,a) / tau) / SUM_b exp(Q(s,b) / tau) - **High tau (temperature)** - all actions are nearly equally likely (exploration) - **Low tau** - the best action dominates (exploitation) - **tau -> 0** - degenerates to greedy (always the best action) Advantage over epsilon-greedy: softmax accounts for *how much* one action is better than another. Epsilon-greedy picks a random action uniformly, even if one of them is directly bad.
In tabular methods (when the state and action spaces are small) the policy is stored as a table: for each state - probabilities over actions. In real tasks there are too many states (Atari - 256^(210*160*3) pixels), so the policy is approximated by a neural network: pi_theta(a|s), where theta are the network weights. This is called a **policy-based** method. The alternative is a **value-based** method, where the policy is derived from the value function (covered next).
An agent uses epsilon-greedy with epsilon = 0.2. It has 5 possible actions, and it estimates that action #3 is the best. What is the probability that the agent selects action #3?
Value Function
The **state value function** V_pi(s) is the expected return when the agent starts from state s and follows policy pi. Formally: V_pi(s) = E_pi[G_t | s_t = s]. It answers the question: "how good is it to be in this state?". A high V(s) means the agent expects to accumulate a lot of reward in the future from this state. The **Q-function** (action-value function) Q_pi(s, a) is the expected return when the agent is in state s, takes action a, and then follows policy pi.
The central formula of RL - the **Bellman equation** - links the value of the current state to the values of successor states. For V: V_pi(s) = SUM_a pi(a|s) * SUM_{s'} P(s'|s,a) * [R(s,a) + gamma * V_pi(s')]. Intuitively: the value of a state = expected immediate reward + discounted value of the next state. This is a recursive definition: V is expressed in terms of V. The Bellman equation is the foundation for computing V and Q via dynamic programming.
There are two classic algorithms based on the Bellman equation. **Value Iteration** iteratively updates V(s) for all states until the values converge. The optimal policy is then extracted from V*: pi*(s) = argmax_a [R(s,a) + gamma * V*(s')]. **Policy Iteration** alternates between two steps: 1. policy evaluation - computing V_pi for the current policy 2. policy improvement - greedily updating the policy based on V_pi. Policy Iteration converges in fewer iterations, but each iteration is more expensive.
**Value Iteration vs Policy Iteration:** - **Value Iteration:** updates V at each iteration until convergence. Each iteration is O(|S|^2 * |A|). May require hundreds of iterations, but each is fast. - **Policy Iteration:** at each iteration fully computes V_pi (solving a system of |S| linear equations), then improves the policy. Usually converges in 3-10 iterations, but each is more expensive. - Both require knowledge of the environment model (P and R) - this is a **model-based** approach. - When the model is unknown (the agent doesn't know P and R, and can only interact with the environment) - **model-free** methods are needed: Q-learning, SARSA, Policy Gradient. These are covered in the next lessons.
The value function is the bridge between the RL problem and its solution. Knowing the optimal Q*(s,a), the optimal policy follows directly: always choose the action with the highest Q*. That's why most RL algorithms (Q-learning, DQN, DDPG) focus on *estimating* the value function. Modern methods use neural networks to approximate Q(s,a) - this is **Deep RL**, which enabled solving Atari games, defeating world champions at Go, and teaching robots to walk.
Reinforcement Learning is about models playing video games
Games are just a convenient testbed. RL is used in robotics (controlling robots and drones), recommendation systems, portfolio optimization, traffic management, data center cooling (Google reduced energy consumption by 40%), and RLHF for training LLMs (ChatGPT, Claude)
Games became the icon of RL due to striking achievements: Atari (DQN), Go (AlphaGo), StarCraft (AlphaStar), Dota 2 (OpenAI Five). But these are merely benchmarks - environments with clear rewards, fast simulation, and well-defined rules, ideal for testing algorithms. In reality, RL is applied wherever sequential decision-making with feedback exists: from controlling plasma in nuclear fusion reactors (DeepMind) to optimizing compilers (MLIR).
Key Takeaways
- **Agent and environment:** RL is built on an interaction loop - the agent observes a state, takes an action, receives a reward, and the environment transitions to a new state. The MDP formalizes this via (S, A, P, R, gamma) with the Markov property
- **Return and discounting:** the agent maximizes not the immediate reward but the discounted return G_t = r_t + gamma*r_{t+1} + gamma^2*r_{t+2}... Gamma determines how far the agent "looks" into the future. Reward shaping - designing the reward signal - critically influences training success
- **Policy:** the agent's strategy pi(a|s) - a mapping from states to actions. Epsilon-greedy and softmax balance exploration (discovering new things) and exploitation (using what is known). Without exploration the agent gets stuck at a local optimum
- **Value function and Bellman equation:** V(s) and Q(s,a) estimate the expected return, and the Bellman equation recursively links the values of neighboring states. Knowing Q* - the optimal Q-function - the agent simply takes the argmax, just as AlphaGo chose moves that beat the world champion
Related Topics
Reinforcement Learning opens up a whole family of algorithms for learning through interaction with an environment:
- Q-Learning — The first model-free RL algorithm: learns Q(s,a) without knowing the environment model, using the Bellman equation and epsilon-greedy exploration. Tabular Q-learning for small environments, Deep Q-Network (DQN) for complex tasks with neural network approximation
- Policy Gradient — An alternative approach: instead of estimating the Q-function, directly optimizes the parameters of the policy pi_theta(a|s) via gradient ascent. REINFORCE, Actor-Critic, and PPO are the key algorithms used in RLHF for training LLMs
Вопросы для размышления
- Why is exploration necessary for finding the optimal policy? Suppose an agent in a maze has found a path with reward +5. How would it discover a path with reward +100 through a different corridor without exploration?
- How does the choice of gamma affect the agent's behavior? In which tasks would you want gamma close to 1 (chess, investing), and in which would gamma = 0.5 suffice (a game with rapid feedback)?
- Reward shaping can speed up learning, but a poorly designed reward leads to undesirable behavior. How would you design a reward for a robot vacuum cleaner? What pitfalls might arise?
Связанные уроки
- ml-47-model-monitoring — Continues the ML curriculum sequence
- ml-49-q-learning — Value methods build on RL basics
- ml-50-policy-gradient — Policy methods build on RL basics
- prob-04-bayes — MDP transitions are conditional probabilities
- alg-21-dp — Bellman equations use dynamic programming
- aie-17-agent-fundamentals
- aie-19-multi-agent