Robotics
Embodied AI: Robot Learning through Imitation and Interaction
2024. Figure 01 - a robot from startup Figure AI in partnership with OpenAI. Demo: natural language command 'give me something to eat' -> scene analysis -> select apple -> hand it to the human -> verbal confirmation. All in one foundation model. Investment in one year: USD 675 million. This is not the future - this is 2024.
- Stanford Mobile ALOHA: household tasks from 50-100 demonstrations, 60-80% success rate
- OpenAI Dactyl: Rubik's cube via RL + domain randomization in simulation
- RT-2 (Google DeepMind): zero-shot manipulation via a 55B vision-language model
Imitation Learning: Learning from Demonstrations
Classic RL for a manipulator: reward correct grasping. But how to define the reward? Too light a grip - the object falls. Too tight - it deforms. Every object needs a different force. Writing a reward function covering all this variety takes years of engineering. Imitation Learning bypasses this: simply show the robot the right way.
**Behavioral Cloning (BC)**: supervised learning on expert demonstrations. State -> action via a neural network. Problem: **distributional shift** - when the robot makes an error it enters a state not in the training set and does not know how to act. **DAgger (Dataset Aggregation)**: an iterative method. Step 1: train a policy on demonstrations. Step 2: run the policy, an expert corrects actions. Step 3: add corrections to the dataset. Repeat. After N iterations the policy covers erroneous states too.
**Stanford Mobile ALOHA (2024)**: a household robot trained to wash dishes, cook food, and clean through 50-100 teleoperation demonstrations. Architecture: Action Chunking with Transformers (ACT) - predicts a chunk of K actions in a single forward pass. Training takes 4 hours on one GPU. Task success rate: 60-80% with 100 demonstrations versus 20-30% for classic RL.
Behavioral Cloning is trained on expert demonstrations. Why does its performance drop sharply when deviating from learned trajectories?
BC trains on i.i.d. states from expert demonstrations. A small action error places the robot in a state not in the training data. BC performs poorly there -> larger error -> state moves further from the training distribution. DAgger solves this by adding erroneous states to the dataset.
Sim-to-Real Transfer: From Simulator to the Real World
Collecting real demonstrations is expensive: one hour of manipulator teleoperation costs USD 20-50 in operator time. Thousands of hours for complex tasks - hundreds of thousands of dollars. A simulator is free. But a policy trained in simulation often does not work in reality: **reality gap** - differences in physics, textures, lighting, contact forces. Sim-to-real transfer is the engineering task of narrowing this gap.
**Domain Randomization**: during simulator training randomly vary physical parameters - object mass ±30%, friction coefficient ±50%, lighting, textures. The policy is forced to learn to work under any parameters -> becomes robust to the real world. OpenAI Dactyl (2019): a manipulator solves a Rubik's cube in simulation with domain randomization -> works in reality without additional training.
**NVIDIA Isaac Sim**: a photorealistic physics simulator on Omniverse. Ray-tracing for accurate lighting, PhysX 5 for deformable objects and soft robotics. Training speed: 10,000x real-time on 8xA100. Used by Boston Dynamics, ABB, and BMW to train robots before deploying to production lines. Reality gap with Isaac Sim is 30-50% smaller than with traditional simulators.
Why does Domain Randomization help with sim-to-real transfer?
Key insight: if the policy succeeds with mass 0.5x-2x, friction 0.3x-1.5x, and any lighting in the simulator, the real world with its specific parameters falls inside this already covered space. The policy does not 'know' reality - it knows uncertainty.
Foundation Models for Embodied AI: RT-2, PaLM-E and Beyond
2023. Google DeepMind publishes RT-2 (Robotic Transformer 2). The first model combining vision-language pre-training with robot actions. Trained on web data (images + text) + robot demonstrations. Result: the robot understands commands for 'take an object not in the demonstrations' via zero-shot transfer from internet knowledge. This is a fundamental shift: knowledge from the internet transfers to the physical world.
**RT-2 architecture**: ViT encoder for images + PaLI-X (55B parameters) for language + a special action token tokeniser (discretising joint angles into 256 bins). Inference: camera image + instruction -> action token sequence. **PaLM-E (562B)**: an embodied language model with physical grounding. Conversation: 'where is the red cup?' -> physical movement -> 'I picked it up' -> verbal confirmation. Language and action in one latent space.
**OpenVLA (2024)**: an open-source RT-2 alternative built on Llama-7B. Fine-tuned on the Open X-Embodiment dataset - 700,000 demonstrations from 22 different robot types. Any researcher can fine-tune on their own robot in 8 hours on a single A100. Democratisation of embodied AI: what cost USD 1M in 2022 costs USD 100 in 2024.
Imitation learning is always better than RL for real robots: it is safe and requires no reward function.
IL is better for tasks with available demonstrations and a short horizon; RL is better for long-horizon tasks, performance optimisation, and tasks where demonstrations are unachievable.
Behavioral Cloning: 100 demonstrations -> 60% success rate on simple pick-and-place. RL in simulation with domain randomization: 0 demonstrations -> 85% success rate on complex dexterous manipulation (OpenAI Dactyl). For household tasks IL is simpler. For complex physical tasks RL in simulation is more powerful.
RT-2 is trained on web data (images + text) and shows zero-shot transfer to unseen objects. Why is this possible?
RT-2 has seen billions of images with descriptions on the internet. It knows what a 'red mug' is, how it looks from different angles and in different lighting. Robot arm demonstrations additionally teach it how to manipulate. Together: visual-semantic knowledge + manipulation knowledge = zero-shot manipulation of new objects.
Related Topics
Embodied AI combines RL, NLP and computer vision in a physical context.
- Reinforcement Learning for Robots — RL is the foundational method that imitation learning complements and extends
- Robotics System Architecture — Architecture of the deployment environment for embodied AI policies
Key Ideas
- Behavioral Cloning: supervised learning state -> action; distributional shift problem on deviation
- DAgger: iterative addition of expert corrections in difficult states to the dataset
- Domain Randomization: diverse physical parameters in simulation = robustness in reality
- RT-2/PaLM-E: foundation models with action tokens - web knowledge + robot demonstrations in one model
- Sim-to-real gap: domain randomization + simulator fidelity improvement (Isaac Sim) reduces the gap
Вопросы для размышления
- How does one choose between Behavioral Cloning and RL for a specific task? What task parameters drive the decision?
- RT-2 uses 55B parameters to control a robot. How can this be deployed on a mobile robot with limited compute?
- Domain Randomization randomises physics. But some parameters cannot be randomised (unique physics of soft materials). How should one approach such cases?
Связанные уроки
- rob-12 — RL for robots is the foundation for understanding imitation learning and RLHF
- rob-15 — Robotics system architecture is the deployment context for embodied AI
- rec-16 — Foundation models in RecSys and Embodied AI: the same paradigm in different domains
- rts-16 — RT stack of autonomous systems - similar requirements to embodied AI