Reinforcement Learning

RL for Robotics

Предварительные знания

RL for games: the sim2real idea and MCTS planning transfer directly to robots
Model-based RL with a physics simulator as the world model
Imitation learning and demonstrations, used to bootstrap manipulation
Reward shaping and the survival-bonus reasoning behind locomotion rewards

Crossing the sim-to-real gap

Training robots directly in the physical world is slow and damages hardware, so the field bet on simulation, which created the sim-to-real gap: a policy that is optimal in a simulator breaks on a real robot. In 2017 Josh Tobin and coauthors at OpenAI introduced domain randomization, deliberately randomizing physics and visual parameters in simulation so the real world looks like just another sample from the training distribution. Around the same period Sergey Levine and collaborators developed guided policy search and large-scale learning of manipulation skills. The headline demonstration came in 2018-2019, when OpenAI's Dactyl project trained a Shadow Hand entirely in simulation to perform dexterous in-hand manipulation and solve a Rubik's cube, transferring to the physical hand without real-world RL training.

ETH Zurich releases ANYmal onto a mountain trail. Every 50ms the policy makes a decision about 12 joints. Rocks, ice, slopes - an environment the simulator only saw as randomized parameters. The robot does not fall. This is 13,000 years of simulation in action.

**Boston Dynamics Spot** - a hybrid of RL and classical control: RL for adaptive locomotion, MPC for planned movements
**Google RT-2** - language-conditioned manipulation: the robot understands 'pick up something edible' without specific training on that query
**Tesla Optimus** - humanoid with RL for manipulation: factory assembly where the sim2real gap is covered by large volumes of demonstration data

Sim2Real: From Simulation to the Real World

The core problem with RL for robots: direct real-world training is impractical. Millions of training steps mean millions of physical contacts - broken hardware and danger. Simulation enables cheap, parallel, risk-free training.

The sim2real gap is the difference between simulation and reality. Friction, elasticity, actuator delays - all approximated in the simulator. A policy optimal in simulation breaks on first contact with a real robot. Domain Randomization is the technique to overcome this: train with randomized physics parameters.

OpenAI Dactyl (2019) solved the Rubik's Cube one-handed - purely via sim2real with domain randomization. 13,000 years of simulation, 0 hours of real training. Then 4 hours on the real hand - task solved. This was the first robot dexterous manipulation result at that level.

Why does Domain Randomization randomly change physical parameters during training?

Locomotion: RL Teaches Robots to Walk

Walking is computationally straightforward for classical control. But adapting to rough terrain, recovering from a shove, climbing stairs - these are not. RL shows qualitatively different results precisely here.

ETH Zurich ANYmal (2022) - landmark result: a quadruped trained in simulation with domain randomization, then deployed on a real robot dog. Speed of 1.5 m/s over rough terrain, 73% more stable than classical MPC under strong external pushes. The RL policy updates every 50ms.

Boston Dynamics Spot does not use pure RL - it is a hybrid system: motion planning + MPC + some RL components for adaptation. DeepMind Robotics and ETH use RL more aggressively. Parkour moves (jumps, rolls) are RL territory; standard walking is MPC territory.

Why is a survival bonus important in the reward function for locomotion?

Manipulation: Grasping and Working with Objects

Manipulation is harder than locomotion. Contact-rich tasks: tighten a bolt, assemble an IKEA chair, pour a glass of water. Every contact is a high-dimensional interaction. Classical planning requires precise models of every object. RL learns from experience.

Google RT-2 (2023) - a turning point: a vision-language-action model. The model trains on robot data plus internet data. It receives a natural language instruction ('pick up the brown thing you cannot eat') and executes it. Emergent zero-shot generalization through scale.

Imitation Learning is critical for manipulation: collecting demonstrations from a human teleooperator is much faster than waiting for RL to discover a valid grasp on its own. Behavior Cloning + RL fine-tuning is the standard pipeline. The ALOHA robot from Stanford uses exactly this.

Why is manipulation harder than locomotion for RL?

Safety: RL and Real-World System Safety

In game RL the cost of a mistake is restarting an episode. In robot RL the cost is damaged hardware, human injury, legal liability. Safety in RL is a separate research area with concrete techniques that stand between the agent and catastrophe.

Constrained MDP: add constraints to the problem. Not just maximize reward, but do not exceed limits on joint torque, approach velocity to obstacles, collision probability. Lagrangian relaxation: a penalty for constraint violation that grows adaptively when constraints are breached.

Control Barrier Functions (CBF) are an elegant approach: mathematically proven safety. A CBF defines a safe set of states and filters RL agent actions in real time: if an action would take the system outside the safe set, it is replaced with the nearest safe action. RL learns within the CBF boundary.

An RL policy that is safe in simulation will be safe on a real robot

Sim2real gap for safety-critical behavior requires separate techniques: CBF, conservative RL, and hardware testing protocols

Simulation approximates physics. Edge cases - unusual contacts, sensor failures, unexpected objects - are not covered in simulation. Safety-critical systems require formal verification or hardware-enforced limits on top of RL

What does a Control Barrier Function do in safety-critical RL?

Key Ideas

**Sim2real** via Domain Randomization: train with randomized physics so the real world becomes 'just another variant'
**Locomotion** with RL outperforms classical control for adaptive terrain and recovery: ETH ANYmal is 73% more stable than MPC under pushes
**Manipulation** requires Imitation Learning + RL: demonstration bootstrapping is critical for contact-rich tasks
**Safety** via Constrained MDP and CBF: mathematically guaranteed constraints matter more than soft reward penalties for real systems

Вопросы для размышления

The sim2real gap for manipulation is larger than for locomotion. What approaches beyond Domain Randomization can help close this gap?
How do you balance exploration in RL training with the safety requirements of a real robot?
Google's RT-2 uses internet-scale data for manipulation. What does this change about the traditional sim2real approach?

Связанные уроки

rl-15 — Techniques from game RL (sim2real idea, MCTS) apply in robot RL
rl-11 — Model-Based RL with a simulator as the world model
rob-12 — RL for robots - a dedicated lesson in the Robotics course
rl-14 — Imitation Learning and demonstrations accelerate robot task learning
rl-17 — Safety in robot RL parallels safety concerns in RLHF alignment
de-01