Machine Learning

Q-Learning and Deep Q-Network

In 2013, a small London startup called DeepMind showed that a single algorithm could learn to play 49 different Atari games at superhuman level - using only raw pixels as input. Google acquired the company for $500 million. The algorithm was called DQN - Deep Q-Network. We break it down piece by piece: from a simple Q-value table to neural network approximation that transformed the entire field of reinforcement learning.

  • **Games and simulations** - DQN and its variants play Atari, Go, StarCraft II, and Dota 2. OpenAI Five defeated world champions in Dota 2, using Q-Learning variants with experience replay to train on thousands of years of simulated gameplay
  • **Data center management** - Google uses DQN to optimize server cooling, reducing energy consumption by 40%. The agent determines optimal air conditioning settings where state = temperatures and loads, and actions = cooling parameters
  • **Recommendation systems** - Q-Learning is applied to sequential recommendations: state = user's action history, action = which content to show next. YouTube and Netflix experiment with RL approaches that maximize long-term engagement instead of single-step clicks

Предварительные знания

  • Introduction to Reinforcement Learning

From a PhD thesis to superhuman Atari

Q-learning has an unusually clean origin story. In 1989 Chris Watkins introduced it in his Cambridge PhD thesis, 'Learning from Delayed Rewards'. His insight was that an agent could learn the optimal action-value function Q* directly from experience, without ever building a model of the environment, by repeatedly nudging each estimate toward the reward plus the discounted best next value. It was model-free and off-policy, meaning the agent could learn the optimal policy even while behaving randomly to explore. The method was elegant but came without a proof that it actually converged. That gap closed in 1992, when Watkins and Peter Dayan published a rigorous convergence proof, showing that under mild conditions the Q-values are guaranteed to reach the optimum. For two decades Q-learning stayed confined to small, tabular problems, since a lookup table cannot scale to images. The breakthrough came from DeepMind: their 2013 Deep Q-Network (DQN) paper, expanded in a 2015 Nature article, replaced the table with a convolutional neural network and added experience replay and a target network for stability. DQN learned to play dozens of Atari games from raw pixels, reaching superhuman scores on many of them and reigniting the entire field of deep reinforcement learning.

Q-Table

In the previous lesson we learned that an agent receives rewards for actions in an environment. But how does it know which action to take in each state? For that, it needs to estimate the **value** of every (state, action) pair. That's exactly what a **Q-table** stores - a two-dimensional table where rows represent states, columns represent actions, and each value Q(s, a) indicates the expected total reward when action a is taken in state s and then the agent acts optimally.

At the start of training the Q-table is filled with zeros - the agent knows nothing about the environment. It begins **exploring**, taking random actions. After each transition (state, action, reward, next_state) the agent updates the corresponding table cell. Gradually the Q-values converge to their true values, and the agent begins **exploiting** its accumulated knowledge by choosing the action with the highest Q.

**Epsilon-greedy strategy - balancing explore vs exploit:** - With probability epsilon: random action (exploration) - With probability 1 - epsilon: action with the highest Q (exploitation) Typical schedule: epsilon starts at 1.0 (100% random actions) and gradually decreases to 0.01-0.1. At the beginning the agent explores the environment; by the end it uses the learned strategy. Without exploration the agent can get stuck in a local optimum: it finds one reward and keeps repeating the same actions, unaware that a larger reward is nearby.

A Q-table works well for environments with a small number of states and actions: FrozenLake (16 states), Taxi-v3 (500 states), simple grid worlds. But what if there are billions of states? In the Atari game Breakout each frame is a 210 x 160 pixel image with 128 possible colors. The number of unique states: 128^(210*160) - a number so large that all the atoms in the universe wouldn't be enough to store such a table. This is precisely where neural network approximation comes in.

An agent is training in an environment with 100 states and 5 actions. What size Q-table does it need, and what is stored in each cell?

Bellman Equation

How exactly are Q-table values updated? Behind this lies the **Bellman equation** - a fundamental recursive relation that links the value of the current state to the value of the next. The idea: the value of action a in state s is the sum of the immediate reward r and the discounted value of the best action in the next state s'. Formally: **Q(s, a) = r + gamma * max_a' Q(s', a')**.

We don't know the true Q-values upfront - we **learn** them. So we use a **temporal difference (TD) update**: at each step we shift the current value Q(s, a) toward a new estimate. The parameter alpha (learning rate) controls how much each update changes the table. Update formula: **Q(s, a) <- Q(s, a) + alpha * (target - Q(s, a))**, where target = r + gamma * max Q(s', a').

**Q-Learning convergence conditions:** Q-Learning is guaranteed to converge to optimal Q-values if: 1. **Every (s, a) pair is visited infinitely many times** - this is why exploration (epsilon-greedy) is needed 2. **Learning rate decreases over time**, but not too fast: sum of alpha_t = infinity, sum of alpha_t^2 < infinity 3. **The environment is stationary** - the rules of the game don't change In practice a fixed alpha = 0.001-0.1 is often used, and Q-Learning converges well enough, though without formal guarantees.

The Bellman equation is more than just an update formula. It is the **principle of optimality**: an optimal strategy consists of optimal sub-strategies. If you know the optimal Q-values for all successor states, you can compute the optimal Q for the current state. Q-Learning applies this principle iteratively: starting from zeros, each update brings the table closer to the true optimal values, propagating reward information backward along the chain of states.

Q(s, a) = 2.0, the agent received reward = 1.0 and ended up in s' where max Q(s', a') = 3.0. With alpha = 0.5 and gamma = 0.9, what will Q(s, a) be after the update?

Deep Q-Network (DQN)

A Q-table works for FrozenLake with 16 states. But in the Atari game Breakout the state is a screen image: 210 x 160 pixels. Storing a Q-value for every unique frame in a table is impossible - the table would need to be larger than the number of atoms in the universe. DeepMind's solution (2013): **replace the table with a neural network**. Instead of explicitly storing Q(s, a) for every state, the neural network is *trained to predict* Q-values for any input state. That is **Deep Q-Network (DQN)**.

The key difference between DQN and plain Q-Learning: the neural network **generalizes** - similar states receive similar Q-values. If the agent learned that a ball in the upper-right corner requires moving right, it will apply that knowledge when the ball is slightly to the left. A Q-table cannot generalize: each state is independent. This ability to generalize is precisely what allowed a single DQN architecture to master 49 Atari games.

**DeepMind and the DQN breakthrough:** **2013** - paper "Playing Atari with Deep Reinforcement Learning". A single neural network learned to play 7 Atari games at superhuman level. Input: raw pixels. Output: Q-values for each action (left, right, fire, etc.). **2015** - Nature paper "Human-level control through deep reinforcement learning". 49 Atari games, DQN surpassed a professional human player in 29 of them. Google acquired DeepMind for $500 million. Two key innovations that made training stable: 1. **Experience replay** - storing and randomly sampling transitions 2. **Target network** - a separate network for computing target values

Why is a **target network** - a second copy of the neural network - needed? Consider what happens when you update Q(s, a) toward target = r + gamma * max Q(s', a'). But Q(s', a') is computed by the very same network you are updating! This means the target constantly "runs away" - you are chasing a moving goal. This leads to instability and divergence. Solution: freeze a copy of the network (target network) and update it every N steps. The main network keeps training while the target network stays fixed, providing stable target values.

Why does DQN use a target network - a separate copy of the neural network updated less frequently than the main one?

Experience Replay

Neural networks train on **mini-batches** - random subsets of data. But in RL, data arrives sequentially: s1->s2->s3->... Training on consecutive transitions means adjacent examples are highly correlated (state 5 is similar to state 6), and the network will forget old experience by overfitting to recent data. **Experience replay** solves this problem: all transitions (s, a, r, s') are stored in a memory buffer, and **random mini-batches** are sampled from the buffer for training.

Experience replay provides three key benefits. First: **breaking correlations** between consecutive samples - random sampling makes data i.i.d. (independent and identically distributed), which is required for stable neural network training. Second: **data efficiency** - each transition can be used for training multiple times, not just once. Third: **stabilized training** - the network sees diverse experience from different parts of the environment, not just the current region.

**Prioritized Experience Replay (PER):** A standard replay buffer samples transitions uniformly at random. But some transitions are more "informative" - those with a large TD error (the difference between predicted and actual Q). The agent learns more from surprising events. **PER** assigns each transition a priority proportional to its TD error: - Large TD error = the transition surprised the agent = high priority - Small TD error = the agent predicts this well = low priority PER speeds up DQN training by roughly 2x on Atari games. However, it requires an additional data structure (sum tree) for efficient priority-based sampling.

DQN with experience replay and a target network became the foundation of modern deep RL. But it has an important limitation: DQN only works with a **discrete action space** - left, right, fire. For continuous actions (turn the steering wheel by 17.3 degrees, set motor power to 0.73) other methods are needed: policy gradient and actor-critic algorithms, covered in the next lesson. DQN selects the best action via argmax over Q-values - this is impossible when there are infinitely many actions.

DQN can solve any RL task since neural networks are universal approximators

DQN only works with discrete actions (left, right, fire). For continuous actions (steering angle, force applied) policy gradient methods are needed (DDPG, SAC, PPO)

DQN selects actions via argmax over Q-values: a* = argmax_a Q(s, a). This is only possible when the number of actions is finite and all options can be enumerated. For continuous actions (e.g., steering angle from -30 to +30 degrees) argmax over an infinite set is impossible. Policy gradient methods solve this: instead of a Q-table or Q-network, they train a policy network that directly outputs an action for a given state.

Why can't DQN train the neural network on sequential transitions (s1->s2->s3->...) and why is experience replay needed?

Key Takeaways

  • **Q-table:** a 2D table Q(s, a) where each cell stores the expected total reward for taking action a in state s - epsilon-greedy balances exploration and exploitation, and the table works for small state spaces (FrozenLake, Taxi)
  • **Bellman equation:** Q(s, a) = r + gamma * max Q(s', a') - the core recursion linking the value of the current action to the immediate reward and the best value of the next state; TD update shifts the Q-value toward the target by step size alpha
  • **Deep Q-Network:** when the state space is too large for a table (billions of pixels in Atari), a neural network approximates the Q-function and generalizes to new states - the target network stabilizes training by fixing the update targets
  • **Experience replay:** a buffer stores transitions (s, a, r, s'), random mini-batch sampling breaks the correlation between sequential data - these ideas are exactly what allowed that London startup DeepMind to build an algorithm that mastered 49 Atari games and was acquired by Google for half a billion dollars

Related Topics

Q-Learning sits at the intersection of reinforcement learning and deep learning, connecting optimal control theory with neural network approximation:

  • Introduction to Reinforcement Learning — The foundation of Q-Learning: the concepts of agent, environment, state, action, reward, policy, MDP - all necessary for understanding the Q-function and the Bellman equation. Q-Learning concretizes the abstract RL problem into a table update algorithm
  • Policy Gradient — An alternative RL approach that addresses the main limitation of DQN: instead of Q-values, a policy network is trained that directly outputs actions. Works with continuous action spaces (robotics, control), where argmax over Q is impossible
  • Neural Networks — DQN uses a neural network as a Q-function approximator: architecture (fully connected or CNN), backpropagation, loss function (MSE between Q-prediction and target) - all these concepts from the neural networks lesson are applied in DQN
  • CNN — In the original DQN architecture for Atari, the input is screen pixels (84x84x4) processed by convolutional layers for visual feature extraction. CNNs allow DQN to work with raw images instead of hand-engineered features

Вопросы для размышления

  • Q-Learning is guaranteed to converge in the tabular case. But DQN uses a neural network approximator - and that guarantee disappears. Which mechanisms (target network, experience replay) help stabilize DQN training, and are they sufficient?
  • Experience replay stores old transitions and reuses them. But the agent's policy changes over time - transitions collected by an old policy may not reflect the current one. This is the off-policy learning problem. How does DQN handle it, and in which cases can this become critical?
  • DQN selects actions via argmax - the maximum Q-value. But what if two actions have Q = 10.0 and Q = 9.99? The agent always picks the first, even though the second is nearly as good. How does this relate to overestimation bias, and which DQN modifications (Double DQN) address this problem?

Связанные уроки

  • ml-48-rl-intro — Q-learning builds on MDP and value basics
  • ml-50-policy-gradient — Value-based vs policy-based RL
  • ml-25-neural-networks — DQN approximates Q with a neural net
  • ml-29-cnn — DQN reads pixels through CNNs
  • prob-13-clt — Experience replay stabilizes value estimates
  • rl-06
Q-Learning and Deep Q-Network

0

1

Sign In