Machine Learning
Policy Gradient
ChatGPT was not trained only on text - the final critical step was called RLHF (Reinforcement Learning from Human Feedback), where a policy gradient algorithm (PPO) learned to generate responses that humans prefer. The same family of algorithms teaches robots to walk, drones to fly, and autonomous vehicles to drive.
- **RLHF for language models** - PPO fine-tunes ChatGPT, Claude, and other LLMs to generate responses aligned with human preferences: safe, helpful, and honest, turning a raw language model into a useful assistant
- **Robotics** - SAC and TD3 train robot arms to grasp objects, robotic dogs to walk on uneven surfaces, and surgical robots to perform precise movements in the operating room
- **Games and strategy** - PPO trained OpenAI Five agents to beat world champions at Dota 2, and DeepMind's AlphaStar at StarCraft II, demonstrating superhuman performance in complex strategy games with incomplete information
Предварительные знания
From REINFORCE to PPO: how policy gradient became the engine of RLHF
In 1992 Ronald J. Williams published the REINFORCE algorithm, giving reinforcement learning a way to optimize a parameterized policy directly through the log-probability trick instead of learning value tables. The idea was elegant but suffered from high variance, so the field added a critic: actor-critic methods estimated a baseline value function to cut the noise in the gradient. The next leap came in 2015, when John Schulman and colleagues introduced TRPO (Trust Region Policy Optimization), which guaranteed stable, near-monotonic improvement by constraining each update with a KL-divergence trust region. TRPO worked but was heavy to implement. In 2017 Schulman and his team at OpenAI published PPO (Proximal Policy Optimization), replacing the hard trust region with a simple clipped objective. PPO was almost as stable and far easier to code, and it became the default policy gradient method across robotics, game-playing agents, and eventually RLHF, the step that turned raw language models into aligned assistants like ChatGPT.
REINFORCE: Direct Policy Optimization
In Q-learning we learned the value function Q(s, a) and derived the policy from it (choose the action with the highest Q). **Policy gradient** takes the opposite approach: we **directly parameterize the policy** pi(a|s; theta) - a neural network that takes a state and outputs action probabilities. The goal is to find parameters theta such that the policy achieves the maximum total reward. It's like the difference between rating every restaurant by score (Q-learning) and directly learning to choose a restaurant you'll enjoy (policy gradient).
The key idea behind REINFORCE is the **log-probability trick** (also known as the score function estimator). We cannot directly differentiate the reward with respect to the network parameters (the reward comes from the environment, not our model). But we can write the gradient of the expected reward via log-probability: **grad J(theta) = E[sum_t grad log pi(a_t|s_t; theta) * R_t]**. Intuition: if an episode produced a large reward R, increase the probabilities of all actions chosen during that episode. If the reward was small - decrease them.
**Policy Gradient Theorem (simplified):** grad J(theta) = E[ sum_t grad log pi(a_t | s_t; theta) * G_t ] Where: - J(theta) - expected total reward (what we maximize) - pi(a_t | s_t; theta) - probability of action a_t in state s_t - G_t - return (total reward from step t to the end of the episode) - grad log pi - direction in which the probability of the action increases **REINFORCE algorithm:** 1. Play a full episode, recording (s_t, a_t, r_t) 2. For each step t compute return G_t = r_t + gamma*r_{t+1} + gamma^2*r_{t+2} + ... 3. Update theta: theta += alpha * grad log pi(a_t|s_t) * G_t 4. Repeat
The main problem with REINFORCE is **high variance**. Since we use a Monte Carlo estimate (full episode), G_t can vary widely from episode to episode. One episode may yield G=200, another G=50 - and the gradients jump around chaotically. A simple trick to reduce variance is **baseline subtraction**: subtract the mean return so that good episodes receive a positive signal and bad ones a negative signal. The normalization of returns in the code above is exactly baseline subtraction.
Why does REINFORCE update parameters in the direction of grad log pi(a|s) * G rather than simply grad pi(a|s) * G?
Advantage Actor-Critic (A2C)
REINFORCE waits until the end of an episode to compute the return G_t. This creates two problematic properties: 1. high variance (one long episode can spoil everything) 2. it cannot learn in tasks without a clear end (continuing tasks). **Actor-Critic** solves both problems by splitting the model into two parts: the **Actor** (policy pi) selects actions, and the **Critic** (value function V(s)) estimates how good the current state is.
The key concept is the **Advantage** A(s, a) = Q(s, a) - V(s). Advantage shows how much a specific action is **better than average** in a given state. If the advantage is positive - the action is better than usual, increase its probability. If negative - worse than average, decrease it. In practice, advantage is estimated via the TD error: **A(s, a) = r + gamma * V(s') - V(s)**, where V(s) is the Critic's output.
**Why Advantage reduces variance:** REINFORCE: grad J = E[grad log pi(a|s) * G_t] - G_t includes ALL future rewards (sum from t to end) - G_t is highly noisy (lots of randomness) Actor-Critic: grad J = E[grad log pi(a|s) * A(s, a)] - A(s, a) = r + gamma * V(s') - V(s) - Uses only ONE-step reward + Critic's estimate - V(s) is a trained neural network - a smooth estimate - Variance is significantly lower **Trade-off (bias-variance tradeoff):** - REINFORCE: unbiased (no bias), high variance - Actor-Critic: biased (Critic can make errors), low variance - In practice, low variance matters more than unbiasedness
A2C (Advantage Actor-Critic) is the **synchronous** version of the algorithm: several parallel environments collect data simultaneously, then all gradients are averaged and applied in a single step. This is more efficient than a single agent in a single environment because parallel environments provide more diverse experience and stabilize training. A2C is the workhorse of policy gradient: straightforward, easy to understand, and effective enough for many tasks.
Advantage A(s, a) = Q(s, a) - V(s). What does a negative advantage for a specific action mean?
PPO - Proximal Policy Optimization
REINFORCE and A2C make **one update step** per batch of data and then discard it. This is wasteful - data took a long time to collect but is used only once. We'd like to make multiple optimization steps on the same batch. But if we update the policy too aggressively, the new policy will be far from the one that collected the data, making updates incorrect. **PPO (Proximal Policy Optimization)** solves this elegantly: it constrains the update size through a **clipped objective**.
**The problem PPO solves:** In policy gradient, data is collected by the old policy pi_old. We want to update to a new policy pi_new. If pi_new is too far from pi_old: - The data is irrelevant (collected by a different policy) - The update can catastrophically degrade the model - Training becomes unstable **TRPO (Trust Region Policy Optimization)** - predecessor: - Added the constraint KL(pi_old || pi_new) < delta - Guaranteed monotonic improvement - But complex to implement (conjugate gradient, line search) **PPO** - a simplification of TRPO: - Instead of a hard constraint - clipping in the loss function - Nearly as stable, but much simpler - 2017, OpenAI - and has been the industry standard ever since
The clipped objective idea: compute the probability ratio **r(theta) = pi_new(a|s) / pi_old(a|s)**. If r = 1 - the policy has not changed. If r = 2 - the new policy makes this action twice as likely. PPO clips this ratio to the range [1 - epsilon, 1 + epsilon], typically epsilon = 0.2. That is, PPO does not allow the probability of any action to change by more than 20% in a single update.
PPO became the industry standard for several reasons: 1. simple to implement - just a few lines different from a regular policy gradient 2. stable - clipping prevents catastrophic updates 3. data-efficient - multiple epochs of optimization on a single data batch are possible 4. general-purpose - works with both discrete and continuous actions. PPO is precisely what was used in **RLHF for ChatGPT**: after pre-training on text, PPO fine-tunes the model to generate responses preferred by humans.
Why does PPO clip the probability ratio r(theta) = pi_new(a|s) / pi_old(a|s)?
Actor-Critic Architectures
Actor-Critic is not a single algorithm but a whole family. All members share the idea: the Actor selects actions, the Critic evaluates them. But implementation details vary widely: how the network is structured, how the Critic is trained, and which tricks are used for stability. Let's go through the main variants and their application domains.
**A3C (Asynchronous Advantage Actor-Critic)** - a historical milestone (DeepMind, 2016). Instead of one agent, N parallel agents run, each in its own copy of the environment with its own copy of the model. Agents asynchronously update a shared model. This predated PPO and solved the correlated-data problem: different agents encounter different situations. However, A3C was overtaken by A2C (the synchronous version) and PPO, because asynchronous updates introduce noise and complicate reproducibility.
**SAC (Soft Actor-Critic) - for continuous actions:** Standard Actor-Critic maximizes reward. SAC maximizes reward + entropy (randomness) of the policy. Why entropy? - Encourages **exploration** - The policy does not collapse to a single action - More robust to environment perturbations SAC features: - Two Critics (Q1, Q2) - the minimum is taken for stability - Actor outputs distribution parameters (mean, std) - Actions are sampled from a Gaussian - Automatic entropy coefficient tuning Where it is used: - Robotics: controlling manipulators, locomotion - Continuous control: motors, joints - Tasks where exploration is critical
Choosing an algorithm depends on the task. **Discrete actions** (Atari, board games, choosing from a fixed set): PPO is the best default. **Continuous actions** (robotics, motor control, physical simulations): SAC or TD3. **Language model training** (RLHF): PPO, but with modifications (KL penalty relative to the base model). General rule: start with PPO, switch to SAC/TD3 only if PPO fails to handle continuous actions.
All these algorithms are descendants of one idea: **directly optimize the policy**. REINFORCE showed it was possible. Actor-Critic reduced variance. PPO made training stable. SAC and TD3 adapted the approach for continuous actions. And RLHF applied it to language models, turning GPT from a text generator into a useful assistant. Policy gradient is one of the most influential frameworks in modern AI.
Policy gradient is always better than Q-learning because it directly optimizes what we want
Q-learning is more sample-efficient for tasks with discrete actions, while policy gradient is necessary for continuous actions and tasks requiring a stochastic policy
DQN can learn from fewer samples because it uses a replay buffer and off-policy learning - each experience is used multiple times. Policy gradient methods (except SAC) are typically on-policy: data is used once and discarded. However, Q-learning does not scale to continuous action spaces (argmax over all actions is required), and for such tasks policy gradient is the only practical option. The right choice depends on the properties of the task, not on the inherent superiority of one approach.
For controlling a robot arm with 7 continuous degrees of freedom (joint angles), which algorithm is the best choice?
Key Takeaways
- **REINFORCE:** directly optimizes the policy via the log-probability trick - grad J = E[grad log pi(a|s) * G_t], but suffers from high variance due to Monte Carlo return estimation
- **Actor-Critic (A2C):** the Actor selects actions, the Critic evaluates them via V(s), advantage A = r + gamma*V(s') - V(s) shows how much the action is better than average, significantly reducing variance
- **PPO:** clipped objective constrains policy changes within [1-eps, 1+eps], allowing multiple epochs of optimization on a single data batch - the industry standard for RLHF and most RL tasks
- **Actor-Critic family:** SAC for continuous actions (robotics), TD3 with twin critics against overestimation, PPO for discrete tasks and LLMs - from REINFORCE to RLHF in ChatGPT, policy gradient has traveled from a theoretical idea to the technology behind the most impressive AI systems
Related Topics
Policy gradient combines neural networks, optimization, and decision theory, connecting reinforcement learning with modern LLMs:
- Q-Learning — Value-based alternative: Q-learning trains a value function for actions, policy gradient trains the policy directly. Q-learning is more data-efficient for discrete tasks, policy gradient is necessary for continuous actions
- Introduction to RL — Foundation: MDP, reward, policy, value function - the basic concepts on which policy gradient is built. Understanding exploration vs exploitation helps explain why SAC maximizes entropy
- Neural Networks — Policy gradient parameterizes the policy with a neural network and trains it via backpropagation. Actor and Critic architectures are regular neural networks with softmax (Actor) or linear (Critic) output heads
- Optimization and Gradient Descent — Policy gradient is gradient ascent in the space of policy parameters. PPO adds step-size constraints, analogous to trust region methods in optimization
Вопросы для размышления
- REINFORCE uses the full return G_t (Monte Carlo), while Actor-Critic uses a one-step TD estimate. Between them lies a spectrum: n-step returns. How does the choice of n affect the bias-variance tradeoff, and when is each preferable?
- PPO constrains policy changes through clipping. What problems would arise if you removed clipping and ran multiple epochs of regular policy gradient on the same data batch?
- RLHF uses PPO to fine-tune language models. Why can't you simply fine-tune the model with supervised learning on the responses that humans rated highly? What does PPO add that supervised learning lacks?
Связанные уроки
- ml-49-q-learning — Policy methods contrast with value methods
- ml-48-rl-intro — Builds on MDP and reward foundations
- ml-25-neural-networks — Policy is a neural network output
- ml-09-gradient-descent — Gradient ascent updates the policy
- calc-19-gradient — The policy gradient theorem uses gradients
- aie-47-autonomous-agents