Machine Learning

Policy Gradient

ChatGPT was not trained only on text - the final critical step was called RLHF (Reinforcement Learning from Human Feedback), where a policy gradient algorithm (PPO) learned to generate responses that humans prefer. The same family of algorithms teaches robots to walk, drones to fly, and autonomous vehicles to drive.

**RLHF for language models** - PPO fine-tunes ChatGPT, Claude, and other LLMs to generate responses aligned with human preferences: safe, helpful, and honest, turning a raw language model into a useful assistant
**Robotics** - SAC and TD3 train robot arms to grasp objects, robotic dogs to walk on uneven surfaces, and surgical robots to perform precise movements in the operating room
**Games and strategy** - PPO trained OpenAI Five agents to beat world champions at Dota 2, and DeepMind's AlphaStar at StarCraft II, demonstrating superhuman performance in complex strategy games with incomplete information

Предварительные знания

Q-Learning and Deep Q-Network

From REINFORCE to PPO: how policy gradient became the engine of RLHF

In 1992 Ronald J. Williams published the REINFORCE algorithm, giving reinforcement learning a way to optimize a parameterized policy directly through the log-probability trick instead of learning value tables. The idea was elegant but suffered from high variance, so the field added a critic: actor-critic methods estimated a baseline value function to cut the noise in the gradient. The next leap came in 2015, when John Schulman and colleagues introduced TRPO (Trust Region Policy Optimization), which guaranteed stable, near-monotonic improvement by constraining each update with a KL-divergence trust region. TRPO worked but was heavy to implement. In 2017 Schulman and his team at OpenAI published PPO (Proximal Policy Optimization), replacing the hard trust region with a simple clipped objective. PPO was almost as stable and far easier to code, and it became the default policy gradient method across robotics, game-playing agents, and eventually RLHF, the step that turned raw language models into aligned assistants like ChatGPT.

REINFORCE: Direct Policy Optimization

In Q-learning we learned the value function Q(s, a) and derived the policy from it (choose the action with the highest Q). **Policy gradient** takes the opposite approach: we **directly parameterize the policy** pi(a|s; theta) - a neural network that takes a state and outputs action probabilities. The goal is to find parameters theta such that the policy achieves the maximum total reward. It's like the difference between rating every restaurant by score (Q-learning) and directly learning to choose a restaurant you'll enjoy (policy gradient).

The key idea behind REINFORCE is the **log-probability trick** (also known as the score function estimator). We cannot directly differentiate the reward with respect to the network parameters (the reward comes from the environment, not our model). But we can write the gradient of the expected reward via log-probability: **grad J(theta) = E[sum_t grad log pi(a_t|s_t; theta) * R_t]**. Intuition: if an episode produced a large reward R, increase the probabilities of all actions chosen during that episode. If the reward was small - decrease them.

**Policy Gradient Theorem (simplified):** grad J(theta) = E[ sum_t grad log pi(a_t | s_t; theta) * G_t ] Where: - J(theta) - expected total reward (what we maximize) - pi(a_t | s_t; theta) - probability of action a_t in state s_t - G_t - return (total reward from step t to the end of the episode) - grad log pi - direction in which the probability of the action increases **REINFORCE algorithm:** 1. Play a full episode, recording (s_t, a_t, r_t) 2. For each step t compute return G_t = r_t + gamma*r_{t+1} + gamma^2*r_{t+2} + ... 3. Update theta: theta += alpha * grad log pi(a_t|s_t) * G_t 4. Repeat

The main problem with REINFORCE is **high variance**. Since we use a Monte Carlo estimate (full episode), G_t can vary widely from episode to episode. One episode may yield G=200, another G=50 - and the gradients jump around chaotically. A simple trick to reduce variance is **baseline subtraction**: subtract the mean return so that good episodes receive a positive signal and bad ones a negative signal. The normalization of returns in the code above is exactly baseline subtraction.

Why does REINFORCE update parameters in the direction of grad log pi(a|s) * G rather than simply grad pi(a|s) * G?

Advantage Actor-Critic (A2C)

REINFORCE waits until the end of an episode to compute the return G_t. This creates two problematic properties: 1. high variance (one long episode can spoil everything) 2. it cannot learn in tasks without a clear end (continuing tasks). **Actor-Critic** solves both problems by splitting the model into two parts: the **Actor** (policy pi) selects actions, and the **Critic** (value function V(s)) estimates how good the current state is.

The key concept is the **Advantage** A(s, a) = Q(s, a) - V(s). Advantage shows how much a specific action is **better than average** in a given state. If the advantage is positive - the action is better than usual, increase its probability. If negative - worse than average, decrease it. In practice, advantage is estimated via the TD error: **A(s, a) = r + gamma * V(s') - V(s)**, where V(s) is the Critic's output.

**Why Advantage reduces variance:** REINFORCE: grad J = E[grad log pi(a|s) * G_t] - G_t includes ALL future rewards (sum from t to end) - G_t is highly noisy (lots of randomness) Actor-Critic: grad J = E[grad log pi(a|s) * A(s, a)] - A(s, a) = r + gamma * V(s') - V(s) - Uses only ONE-step reward + Critic's estimate - V(s) is a trained neural network - a smooth estimate - Variance is significantly lower **Trade-off (bias-variance tradeoff):** - REINFORCE: unbiased (no bias), high variance - Actor-Critic: biased (Critic can make errors), low variance - In practice, low variance matters more than unbiasedness

A2C (Advantage Actor-Critic) is the **synchronous** version of the algorithm: several parallel environments collect data simultaneously, then all gradients are averaged and applied in a single step. This is more efficient than a single agent in a single environment because parallel environments provide more diverse experience and stabilize training. A2C is the workhorse of policy gradient: straightforward, easy to understand, and effective enough for many tasks.

Advantage A(s, a) = Q(s, a) - V(s). What does a negative advantage for a specific action mean?

PPO - Proximal Policy Optimization

REINFORCE and A2C make **one update step** per batch of data and then discard it. This is wasteful - data took a long time to collect but is used only once. We'd like to make multiple optimization steps on the same batch. But if we update the policy too aggressively, the new policy will be far from the one that collected the data, making updates incorrect. **PPO (Proximal Policy Optimization)** solves this elegantly: it constrains the update size through a **clipped objective**.

**The problem PPO solves:** In policy gradient, data is collected by the old policy pi_old. We want to update to a new policy pi_new. If pi_new is too far from pi_old: - The data is irrelevant (collected by a different policy) - The update can catastrophically degrade the model - Training becomes unstable **TRPO (Trust Region Policy Optimization)** - predecessor: - Added the constraint KL(pi_old || pi_new) < delta - Guaranteed monotonic improvement - But complex to implement (conjugate gradient, line search) **PPO** - a simplification of TRPO: - Instead of a hard constraint - clipping in the loss function - Nearly as stable, but much simpler - 2017, OpenAI - and has been the industry standard ever since

The clipped objective idea: compute the probability ratio **r(theta) = pi_new(a|s) / pi_old(a|s)**. If r = 1 - the policy has not changed. If r = 2 - the new policy makes this action twice as likely. PPO clips this ratio to the range [1 - epsilon, 1 + epsilon], typically epsilon = 0.2. That is, PPO does not allow the probability of any action to change by more than 20% in a single update.

PPO became the industry standard for several reasons: 1. simple to implement - just a few lines different from a regular policy gradient 2. stable - clipping prevents catastrophic updates 3. data-efficient - multiple epochs of optimization on a single data batch are possible 4. general-purpose - works with both discrete and continuous actions. PPO is precisely what was used in **RLHF for ChatGPT**: after pre-training on text, PPO fine-tunes the model to generate responses preferred by humans.

Why does PPO clip the probability ratio r(theta) = pi_new(a|s) / pi_old(a|s)?

Actor-Critic Architectures

Actor-Critic is not a single algorithm but a whole family. All members share the idea: the Actor selects actions, the Critic evaluates them. But implementation details vary widely: how the network is structured, how the Critic is trained, and which tricks are used for stability. Let's go through the main variants and their application domains.

**A3C (Asynchronous Advantage Actor-Critic)** - a historical milestone (DeepMind, 2016). Instead of one agent, N parallel agents run, each in its own copy of the environment with its own copy of the model. Agents asynchronously update a shared model. This predated PPO and solved the correlated-data problem: different agents encounter different situations. However, A3C was overtaken by A2C (the synchronous version) and PPO, because asynchronous updates introduce noise and complicate reproducibility.

**SAC (Soft Actor-Critic) - for continuous actions:** Standard Actor-Critic maximizes reward. SAC maximizes reward + entropy (randomness) of the policy. Why entropy? - Encourages **exploration** - The policy does not collapse to a single action - More robust to environment perturbations SAC features: - Two Critics (Q1, Q2) - the minimum is taken for stability - Actor outputs distribution parameters (mean, std) - Actions are sampled from a Gaussian - Automatic entropy coefficient tuning Where it is used: - Robotics: controlling manipulators, locomotion - Continuous control: motors, joints - Tasks where exploration is critical

Choosing an algorithm depends on the task. **Discrete actions** (Atari, board games, choosing from a fixed set): PPO is the best default. **Continuous actions** (robotics, motor control, physical simulations): SAC or TD3. **Language model training** (RLHF): PPO, but with modifications (KL penalty relative to the base model). General rule: start with PPO, switch to SAC/TD3 only if PPO fails to handle continuous actions.

All these algorithms are descendants of one idea: **directly optimize the policy**. REINFORCE showed it was possible. Actor-Critic reduced variance. PPO made training stable. SAC and TD3 adapted the approach for continuous actions. And RLHF applied it to language models, turning GPT from a text generator into a useful assistant. Policy gradient is one of the most influential frameworks in modern AI.

Policy gradient is always better than Q-learning because it directly optimizes what we want

Q-learning is more sample-efficient for tasks with discrete actions, while policy gradient is necessary for continuous actions and tasks requiring a stochastic policy

DQN can learn from fewer samples because it uses a replay buffer and off-policy learning - each experience is used multiple times. Policy gradient methods (except SAC) are typically on-policy: data is used once and discarded. However, Q-learning does not scale to continuous action spaces (argmax over all actions is required), and for such tasks policy gradient is the only practical option. The right choice depends on the properties of the task, not on the inherent superiority of one approach.

For controlling a robot arm with 7 continuous degrees of freedom (joint angles), which algorithm is the best choice?

Key Takeaways

**REINFORCE:** directly optimizes the policy via the log-probability trick - grad J = E[grad log pi(a|s) * G_t], but suffers from high variance due to Monte Carlo return estimation
**Actor-Critic (A2C):** the Actor selects actions, the Critic evaluates them via V(s), advantage A = r + gamma*V(s') - V(s) shows how much the action is better than average, significantly reducing variance
**PPO:** clipped objective constrains policy changes within [1-eps, 1+eps], allowing multiple epochs of optimization on a single data batch - the industry standard for RLHF and most RL tasks
**Actor-Critic family:** SAC for continuous actions (robotics), TD3 with twin critics against overestimation, PPO for discrete tasks and LLMs - from REINFORCE to RLHF in ChatGPT, policy gradient has traveled from a theoretical idea to the technology behind the most impressive AI systems

Вопросы для размышления

REINFORCE uses the full return G_t (Monte Carlo), while Actor-Critic uses a one-step TD estimate. Between them lies a spectrum: n-step returns. How does the choice of n affect the bias-variance tradeoff, and when is each preferable?
PPO constrains policy changes through clipping. What problems would arise if you removed clipping and ran multiple epochs of regular policy gradient on the same data batch?
RLHF uses PPO to fine-tune language models. Why can't you simply fine-tune the model with supervised learning on the responses that humans rated highly? What does PPO add that supervised learning lacks?

Связанные уроки

ml-49-q-learning — Policy methods contrast with value methods
ml-48-rl-intro — Builds on MDP and reward foundations
ml-25-neural-networks — Policy is a neural network output
ml-09-gradient-descent — Gradient ascent updates the policy
calc-19-gradient — The policy gradient theorem uses gradients
aie-47-autonomous-agents

Machine Learning

Policy Gradient

**RLHF for language models** - PPO fine-tunes ChatGPT, Claude, and other LLMs to generate responses aligned with human preferences: safe, helpful, and honest, turning a raw language model into a useful assistant
**Robotics** - SAC and TD3 train robot arms to grasp objects, robotic dogs to walk on uneven surfaces, and surgical robots to perform precise movements in the operating room
**Games and strategy** - PPO trained OpenAI Five agents to beat world champions at Dota 2, and DeepMind's AlphaStar at StarCraft II, demonstrating superhuman performance in complex strategy games with incomplete information

Предварительные знания

Q-Learning and Deep Q-Network

From REINFORCE to PPO: how policy gradient became the engine of RLHF

REINFORCE: Direct Policy Optimization

Why does REINFORCE update parameters in the direction of grad log pi(a|s) * G rather than simply grad pi(a|s) * G?

Advantage Actor-Critic (A2C)

Advantage A(s, a) = Q(s, a) - V(s). What does a negative advantage for a specific action mean?

PPO - Proximal Policy Optimization

Why does PPO clip the probability ratio r(theta) = pi_new(a|s) / pi_old(a|s)?

Actor-Critic Architectures

Policy gradient is always better than Q-learning because it directly optimizes what we want

Q-learning is more sample-efficient for tasks with discrete actions, while policy gradient is necessary for continuous actions and tasks requiring a stochastic policy

For controlling a robot arm with 7 continuous degrees of freedom (joint angles), which algorithm is the best choice?

Key Takeaways

**REINFORCE:** directly optimizes the policy via the log-probability trick - grad J = E[grad log pi(a|s) * G_t], but suffers from high variance due to Monte Carlo return estimation
**Actor-Critic (A2C):** the Actor selects actions, the Critic evaluates them via V(s), advantage A = r + gamma*V(s') - V(s) shows how much the action is better than average, significantly reducing variance
**PPO:** clipped objective constrains policy changes within [1-eps, 1+eps], allowing multiple epochs of optimization on a single data batch - the industry standard for RLHF and most RL tasks
**Actor-Critic family:** SAC for continuous actions (robotics), TD3 with twin critics against overestimation, PPO for discrete tasks and LLMs - from REINFORCE to RLHF in ChatGPT, policy gradient has traveled from a theoretical idea to the technology behind the most impressive AI systems

Вопросы для размышления

REINFORCE uses the full return G_t (Monte Carlo), while Actor-Critic uses a one-step TD estimate. Between them lies a spectrum: n-step returns. How does the choice of n affect the bias-variance tradeoff, and when is each preferable?
PPO constrains policy changes through clipping. What problems would arise if you removed clipping and ran multiple epochs of regular policy gradient on the same data batch?
RLHF uses PPO to fine-tune language models. Why can't you simply fine-tune the model with supervised learning on the responses that humans rated highly? What does PPO add that supervised learning lacks?

Связанные уроки

ml-49-q-learning — Policy methods contrast with value methods
ml-48-rl-intro — Builds on MDP and reward foundations
ml-25-neural-networks — Policy is a neural network output
ml-09-gradient-descent — Gradient ascent updates the policy
calc-19-gradient — The policy gradient theorem uses gradients
aie-47-autonomous-agents

Policy Gradient

Предварительные знания

From REINFORCE to PPO: how policy gradient became the engine of RLHF

REINFORCE: Direct Policy Optimization

Advantage Actor-Critic (A2C)

PPO - Proximal Policy Optimization

Actor-Critic Architectures

Key Takeaways

Related Topics

Вопросы для размышления

Связанные уроки

Policy Gradient

Предварительные знания

From REINFORCE to PPO: how policy gradient became the engine of RLHF

REINFORCE: Direct Policy Optimization

Advantage Actor-Critic (A2C)

PPO - Proximal Policy Optimization

Actor-Critic Architectures

Key Takeaways

Related Topics

Вопросы для размышления

Связанные уроки