Robotics

Embodied AI: Robot Learning through Imitation and Interaction

2024. Figure 01 - a robot from startup Figure AI in partnership with OpenAI. Demo: natural language command 'give me something to eat' -> scene analysis -> select apple -> hand it to the human -> verbal confirmation. All in one foundation model. Investment in one year: USD 675 million. This is not the future - this is 2024.

  • Stanford Mobile ALOHA: household tasks from 50-100 demonstrations, 60-80% success rate
  • OpenAI Dactyl: Rubik's cube via RL + domain randomization in simulation
  • RT-2 (Google DeepMind): zero-shot manipulation via a 55B vision-language model

Imitation Learning: Learning from Demonstrations

Classic RL for a manipulator: reward correct grasping. But how to define the reward? Too light a grip - the object falls. Too tight - it deforms. Every object needs a different force. Writing a reward function covering all this variety takes years of engineering. Imitation Learning bypasses this: simply show the robot the right way.

**Behavioral Cloning (BC)**: supervised learning on expert demonstrations. State -> action via a neural network. Problem: **distributional shift** - when the robot makes an error it enters a state not in the training set and does not know how to act. **DAgger (Dataset Aggregation)**: an iterative method. Step 1: train a policy on demonstrations. Step 2: run the policy, an expert corrects actions. Step 3: add corrections to the dataset. Repeat. After N iterations the policy covers erroneous states too.

**Stanford Mobile ALOHA (2024)**: a household robot trained to wash dishes, cook food, and clean through 50-100 teleoperation demonstrations. Architecture: Action Chunking with Transformers (ACT) - predicts a chunk of K actions in a single forward pass. Training takes 4 hours on one GPU. Task success rate: 60-80% with 100 demonstrations versus 20-30% for classic RL.

Behavioral Cloning is trained on expert demonstrations. Why does its performance drop sharply when deviating from learned trajectories?

BC trains on i.i.d. states from expert demonstrations. A small action error places the robot in a state not in the training data. BC performs poorly there -> larger error -> state moves further from the training distribution. DAgger solves this by adding erroneous states to the dataset.

Sim-to-Real Transfer: From Simulator to the Real World

Collecting real demonstrations is expensive: one hour of manipulator teleoperation costs USD 20-50 in operator time. Thousands of hours for complex tasks - hundreds of thousands of dollars. A simulator is free. But a policy trained in simulation often does not work in reality: **reality gap** - differences in physics, textures, lighting, contact forces. Sim-to-real transfer is the engineering task of narrowing this gap.

**Domain Randomization**: during simulator training randomly vary physical parameters - object mass ±30%, friction coefficient ±50%, lighting, textures. The policy is forced to learn to work under any parameters -> becomes robust to the real world. OpenAI Dactyl (2019): a manipulator solves a Rubik's cube in simulation with domain randomization -> works in reality without additional training.

**NVIDIA Isaac Sim**: a photorealistic physics simulator on Omniverse. Ray-tracing for accurate lighting, PhysX 5 for deformable objects and soft robotics. Training speed: 10,000x real-time on 8xA100. Used by Boston Dynamics, ABB, and BMW to train robots before deploying to production lines. Reality gap with Isaac Sim is 30-50% smaller than with traditional simulators.

Why does Domain Randomization help with sim-to-real transfer?

Key insight: if the policy succeeds with mass 0.5x-2x, friction 0.3x-1.5x, and any lighting in the simulator, the real world with its specific parameters falls inside this already covered space. The policy does not 'know' reality - it knows uncertainty.

Foundation Models for Embodied AI: RT-2, PaLM-E and Beyond

2023. Google DeepMind publishes RT-2 (Robotic Transformer 2). The first model combining vision-language pre-training with robot actions. Trained on web data (images + text) + robot demonstrations. Result: the robot understands commands for 'take an object not in the demonstrations' via zero-shot transfer from internet knowledge. This is a fundamental shift: knowledge from the internet transfers to the physical world.

**RT-2 architecture**: ViT encoder for images + PaLI-X (55B parameters) for language + a special action token tokeniser (discretising joint angles into 256 bins). Inference: camera image + instruction -> action token sequence. **PaLM-E (562B)**: an embodied language model with physical grounding. Conversation: 'where is the red cup?' -> physical movement -> 'I picked it up' -> verbal confirmation. Language and action in one latent space.

**OpenVLA (2024)**: an open-source RT-2 alternative built on Llama-7B. Fine-tuned on the Open X-Embodiment dataset - 700,000 demonstrations from 22 different robot types. Any researcher can fine-tune on their own robot in 8 hours on a single A100. Democratisation of embodied AI: what cost USD 1M in 2022 costs USD 100 in 2024.

Imitation learning is always better than RL for real robots: it is safe and requires no reward function.

IL is better for tasks with available demonstrations and a short horizon; RL is better for long-horizon tasks, performance optimisation, and tasks where demonstrations are unachievable.

Behavioral Cloning: 100 demonstrations -> 60% success rate on simple pick-and-place. RL in simulation with domain randomization: 0 demonstrations -> 85% success rate on complex dexterous manipulation (OpenAI Dactyl). For household tasks IL is simpler. For complex physical tasks RL in simulation is more powerful.

RT-2 is trained on web data (images + text) and shows zero-shot transfer to unseen objects. Why is this possible?

RT-2 has seen billions of images with descriptions on the internet. It knows what a 'red mug' is, how it looks from different angles and in different lighting. Robot arm demonstrations additionally teach it how to manipulate. Together: visual-semantic knowledge + manipulation knowledge = zero-shot manipulation of new objects.

Related Topics

Embodied AI combines RL, NLP and computer vision in a physical context.

  • Reinforcement Learning for Robots — RL is the foundational method that imitation learning complements and extends
  • Robotics System Architecture — Architecture of the deployment environment for embodied AI policies

Key Ideas

  • Behavioral Cloning: supervised learning state -> action; distributional shift problem on deviation
  • DAgger: iterative addition of expert corrections in difficult states to the dataset
  • Domain Randomization: diverse physical parameters in simulation = robustness in reality
  • RT-2/PaLM-E: foundation models with action tokens - web knowledge + robot demonstrations in one model
  • Sim-to-real gap: domain randomization + simulator fidelity improvement (Isaac Sim) reduces the gap

Вопросы для размышления

  • How does one choose between Behavioral Cloning and RL for a specific task? What task parameters drive the decision?
  • RT-2 uses 55B parameters to control a robot. How can this be deployed on a mobile robot with limited compute?
  • Domain Randomization randomises physics. But some parameters cannot be randomised (unique physics of soft materials). How should one approach such cases?

Связанные уроки

  • rob-12 — RL for robots is the foundation for understanding imitation learning and RLHF
  • rob-15 — Robotics system architecture is the deployment context for embodied AI
  • rec-16 — Foundation models in RecSys and Embodied AI: the same paradigm in different domains
  • rts-16 — RT stack of autonomous systems - similar requirements to embodied AI
Embodied AI: Robot Learning through Imitation and Interaction

0

1

Sign In