What Is Imitation Learning?

Imitation learning trains a robot policy by showing it examples of desired behavior. Instead of manually programming trajectories or engineering reward functions for reinforcement learning, you demonstrate the task through teleoperation — and the robot learns to replicate your behavior from the demonstration data.

The core idea is straightforward: collect a dataset of (observation, action) pairs from human demonstrations, then train a neural network to predict the correct action given the current observation. The trained network becomes the robot's policy, running in real time to control the robot autonomously.

In practice, imitation learning is the most reliable method for getting a robot to perform a new manipulation task in 2026. Reinforcement learning requires millions of environment interactions and carefully shaped reward functions. Classical motion planning requires explicit geometric models of every object. Imitation learning requires only demonstrations — and with modern architectures, surprisingly few of them.

Brief History: From Behavioral Cloning to Modern Architectures

Behavioral cloning (BC) is the simplest form: supervised learning that directly maps observations to actions. Train a neural network on (image, joint_position) → (next_action) pairs using standard regression loss. BC works well when demonstration data is consistent and covers the distribution of states the robot encounters during deployment. It fails when the robot drifts off the demonstrated trajectory — a problem called compounding error or distribution shift.

DAgger (Dataset Aggregation), introduced by Ross et al. in 2011, addresses distribution shift by iteratively collecting new demonstrations at states the robot actually visits. After training an initial BC policy, you deploy it, let it drift, then have the human correct the resulting states. These corrections are added to the training set. DAgger is theoretically elegant but operationally expensive — it requires a human expert available during every training iteration.

GAIL (Generative Adversarial Imitation Learning), introduced by Ho and Ermon in 2016, combines imitation learning with adversarial training: a discriminator learns to distinguish robot behavior from human behavior, and the policy is trained to fool the discriminator. GAIL handles multi-modal demonstrations better than BC but requires online environment interaction (typically in simulation) and is notoriously difficult to tune. It sees limited use in real-robot settings today.

The three approaches that dominate real-world robot manipulation in 2026 — ACT, Diffusion Policy, and VLA models — all build on the behavioral cloning paradigm but add architectural innovations that address its core limitations.

Why Imitation Learning Won for Manipulation

Before 2023, most robotics labs assumed reinforcement learning would eventually solve manipulation. That expectation has not materialized for contact-rich tasks. Here is why imitation learning became the default:

  • Sample efficiency: A useful imitation learning policy can be trained on 50–500 human demonstrations collected in a single day. RL for the same task requires millions of environment steps, taking weeks of simulation or months on real hardware.
  • No reward engineering: RL requires a reward function that captures exactly what "success" means. For manipulation tasks like folding a garment or assembling components, defining reward is enormously difficult. Imitation learning sidesteps the problem entirely — the human demonstrations implicitly define the objective.
  • Teleoperation data is natural: Modern teleoperation hardware (leader-follower arms, VR controllers, exoskeleton gloves) makes it straightforward to collect high-quality demonstrations. The data collection process maps directly to the policy input format.
  • Real-world compatibility: RL trains best in simulation, but sim-to-real transfer for contact-rich manipulation remains unreliable. Imitation learning trains directly on real-world data, eliminating the reality gap for the training phase.
  • Composability with foundation models: Pre-trained vision-language models (SigLIP, DINOv2) provide powerful visual representations that imitation learning can leverage directly. RL cannot easily incorporate pre-trained representations because the reward signal must flow through the entire network.

RL still dominates for locomotion (Unitree's G1, Boston Dynamics' Atlas) and for tasks with clear reward signals (game playing, navigation). For manipulation — picking, placing, inserting, folding, pouring, assembling — imitation learning is the practical choice.

ACT: Action Chunking with Transformers

ACT, introduced by Tony Zhao et al. at Stanford in 2023, is the workhorse of imitation learning for tabletop manipulation. It predicts a chunk (sequence) of future actions rather than a single next action, which directly addresses the compounding error problem in behavioral cloning.

How ACT Works

ACT uses a CVAE (Conditional Variational Autoencoder) architecture with a Transformer backbone. The encoder processes the current observation — camera images passed through a ResNet or ViT visual encoder, plus joint positions — and encodes it into a latent representation. The decoder takes this latent code and generates a sequence of k future actions (typically k=100, representing 2 seconds at 50 Hz control).

The key insight is temporal smoothing through chunking: by predicting 100 future actions at once, the model must produce a coherent trajectory, not just the immediate next action. This acts as an implicit regularizer that produces smooth, consistent motions even when individual demonstration trajectories vary slightly.

During deployment, ACT uses temporal ensembling: at each timestep, it generates a new 100-step action chunk, but the executed action is a weighted average of the current chunk and previous chunks' predicted actions for this timestep. This further smooths the trajectory and reduces jitter.

When to Use ACT

  • Tabletop manipulation with 1–2 robot arms (the original ALOHA use case)
  • Tasks with relatively deterministic strategies (pick-place, insertion, assembly)
  • When you have 50–200 demonstrations and need a fast training pipeline
  • When you need to deploy on modest hardware (runs on a single consumer GPU, ~15 ms inference)

Typical Data Requirements for ACT

  • Simple pick-and-place: 50 demonstrations → 85%+ success
  • Bimanual coordination: 100 demonstrations → 80%+ success
  • Precise insertion (peg-in-hole, USB): 200 demonstrations → 70%+ success
  • Complex multi-step tasks: 400–800 demonstrations needed

Quality matters more than quantity. 50 clean, consistent demonstrations often outperform 500 noisy ones.

Training ACT with LeRobot

The most accessible way to train ACT in 2026 is through Hugging Face's LeRobot framework. After converting your HDF5 demonstration data to LeRobot format:

# Install LeRobot
pip install lerobot

# Convert your dataset
python -m lerobot.scripts.convert_dataset \
  --raw-dir ./my_episodes/ \
  --repo-id my-org/my-task \
  --raw-format hdf5_aloha

# Train ACT policy
python -m lerobot.scripts.train \
  policy=act \
  dataset_repo_id=my-org/my-task \
  training.num_epochs=2000 \
  training.batch_size=8

Training typically converges in 2–8 hours on a single RTX 4090 for datasets of 100–200 episodes.

ACT: Pros and Cons

Pros Cons
Extremely data-efficient (50 demos for simple tasks)Struggles with multi-modal action distributions
Fast inference (50+ Hz on consumer GPU)No language conditioning
Smooth trajectories via temporal ensemblingSensitive to demonstration inconsistency
Well-supported in LeRobot frameworkSingle-task only (one policy per task)
Low training compute (single GPU, hours)Temporal ensembling weight requires manual tuning

Diffusion Policy

Diffusion Policy, introduced by Cheng Chi et al. at Columbia in 2023, applies the denoising diffusion framework — the same approach behind image generation models like Stable Diffusion — to robot action prediction. It has become the go-to choice for tasks where multiple valid strategies exist in the demonstration data.

How Diffusion Policy Works

Instead of directly predicting actions, Diffusion Policy learns to iteratively denoise a random noise vector into a clean action trajectory. During training, it takes a clean action trajectory from a demonstration, adds varying levels of Gaussian noise, and trains a network (typically a 1D U-Net or Transformer) to predict the added noise. During inference, it starts with pure noise and iteratively denoises it over K steps (typically K=10–50 using DDIM scheduling) to produce a clean action trajectory.

The critical advantage is multi-modality: diffusion models naturally represent multiple valid solutions. If there are two equally good ways to grasp an object (approach from the left vs. the right), Diffusion Policy can represent both modes without averaging them into a single incorrect middle trajectory. Standard behavioral cloning with MSE loss averages modes, which is catastrophic for bimodal tasks.

Diffusion Policy also produces exceptionally smooth trajectories because the denoising process inherently generates temporally consistent action sequences. This makes it well-suited for contact-rich tasks like wiping, pouring, and surface following where trajectory smoothness directly affects success.

When to Use Diffusion Policy

  • Tasks with multiple valid strategies (data contains different approaches to the same goal)
  • Contact-rich tasks where precise force trajectories matter
  • Tasks requiring high trajectory smoothness (pouring, drawing, surface wiping)
  • When you have data from multiple operators with slightly different styles
  • When you have a GPU with enough compute for iterative denoising (30–80 ms inference)

Typical Data Requirements for Diffusion Policy

  • Simple unimodal tasks: 100–200 demonstrations
  • Multi-modal tasks (multiple valid strategies): 200–500 demonstrations
  • Complex contact-rich manipulation: 500–1,000 demonstrations

Key difference from ACT: Diffusion Policy is less sensitive to demonstration inconsistency because it can represent multimodal distributions. This makes it more forgiving of data collected by multiple operators.

Diffusion Policy: Pros and Cons

Pros Cons
Handles multi-modal action distributions nativelySlower inference than ACT (10-30 Hz)
Exceptionally smooth trajectoriesNeeds more data than ACT for simple tasks
Robust to operator variation in demonstrationsMore hyperparameters (diffusion steps, noise schedule)
Strong performance on contact-rich tasksNo language conditioning (without extensions)
Supported in LeRobot and robomimicHigher GPU memory requirements during training

OpenVLA and Vision-Language-Action Models

Vision-Language-Action (VLA) models represent the frontier of robot learning in 2026. They combine a pre-trained vision-language model with action prediction, enabling robots to follow natural language instructions ("pick up the red cup and place it on the tray") and generalize across tasks and objects they have never seen during fine-tuning.

What Changes with Foundation Models

Traditional imitation learning (ACT, Diffusion Policy) trains task-specific policies from scratch. Each new task requires a new dataset and a new training run. VLA models change this equation by providing a pre-trained representation that already understands visual scenes and language — you fine-tune on your specific robot and task rather than training from zero.

The practical consequence: VLA models can achieve reasonable performance with fewer task-specific demonstrations because the heavy lifting of visual understanding and language grounding was done during pre-training on internet-scale data and large cross-embodiment robot datasets like Open X-Embodiment (970K+ episodes, 22+ robot types).

Key VLA Models

  • OpenVLA (Berkeley, 2024): Open-source 7B parameter VLA built on Llama 2 + SigLIP. Trained on Open X-Embodiment data. Fine-tunes to new robots with 100–500 demonstrations. Runs at ~5 Hz on a single A100 GPU.
  • pi0 (Physical Intelligence, 2024–2025): Proprietary VLA trained on cross-embodiment data including dexterous hands. Demonstrated zero-shot transfer to unseen tasks. Available through Physical Intelligence's API.
  • Octo (Berkeley, 2024): Lightweight 93M parameter generalist policy. Designed for fast fine-tuning with 50–200 demonstrations. Open-source, runs on a single consumer GPU.
  • RT-2 (Google DeepMind, 2023): 55B parameter VLA based on PaLM-E. State-of-the-art generalization but requires massive compute (8 TPU v5e chips). Not publicly available.

Fine-Tuning vs. Training from Scratch

Fine-tuning a pre-trained VLA (the practical path): Start from an OpenVLA or Octo checkpoint. Collect 100–500 demonstrations on your robot. Fine-tune using LoRA (for 7B+ models) or full fine-tuning (for Octo-scale models). Training time: 4–24 hours on 1–4 GPUs. This leverages the pre-trained visual and language representations and is the recommended approach for most teams.

Training a VLA from scratch (for well-funded labs only): Requires 10,000–100,000+ demonstrations across diverse robots and tasks, plus 100–1,000+ GPU-hours. Only justified if you are building a foundation model for a new robot embodiment class or need capabilities not covered by existing VLAs.

VLA Models: Pros and Cons

Pros Cons
Language-conditioned task executionSlow inference (3-10 Hz)
Multi-task from a single modelRequires A100-class GPU for 7B+ models
Strong generalization to new objectsFine-tuning is more complex than ACT/DP training
Leverages internet-scale pre-trainingLarger models are sensitive to prompt formatting
Cross-embodiment transfer potentialStill maturing — fewer deployment success stories than ACT/DP

Algorithm Comparison Table

Algorithm Data Efficiency Generalization Compute (Train) Compute (Infer) Ease of Training Best For
ACT Very high (50-200) Low (single task) 1 GPU, 2-8 hrs 50+ Hz, 1 GPU Easy Single-task tabletop, fast iteration
Diffusion Policy High (100-500) Medium (multi-strategy) 1 GPU, 4-16 hrs 10-30 Hz, 1 GPU Moderate Multi-modal tasks, contact-rich
Behavioral Cloning High (50-200) Very low 1 GPU, 1-4 hrs 100+ Hz Very easy Prototyping, baselines
GAIL Medium (100-1000) Medium Needs sim, days 50+ Hz Difficult Sim-only research, locomotion
OpenVLA / Octo High (50-500 fine-tune) High (cross-task) 1-4 A100s, 4-24 hrs 3-10 Hz, 1 A100 Moderate-Hard Multi-task, language-conditioned

The Training Pipeline: 5 Stages

Regardless of which algorithm you choose, the training pipeline follows these stages. The details at each stage matter enormously for final policy performance.

Stage 1: Data Collection

Collect human demonstrations via teleoperation. Use leader-follower arms for the highest quality data, or VR controllers for faster throughput on gross-manipulation tasks. Record at 50 Hz for joint data and 30 fps for cameras. Store in HDF5 format. See our data collection guide for the full protocol.

Critical decisions at this stage: number of demonstrations (see algorithm-specific recommendations above), operator consistency (1–3 operators for bootstrap), start-state randomization (vary object positions systematically), and camera setup (minimum: 1 wrist + 1 overhead camera).

Stage 2: Data Formatting and Preprocessing

Convert raw recordings to training-ready format:

  • Image resizing: Resize camera frames to the model's expected input resolution (typically 224x224 for ViT-based encoders, 256x256 for ResNet). Use anti-aliased resizing (Lanczos or bicubic) to preserve fine detail.
  • Action normalization: Normalize actions to [-1, 1] range. Compute mean and standard deviation from the training set. Apply identical normalization at deployment — a mismatch here causes silent failure.
  • Episode filtering: Remove failed episodes, episodes with anomalous duration (>3 standard deviations from mean), and episodes flagged during quality review.
  • Train/validation split: Hold out 10–15% of episodes for validation. Split by episode, never by frame — putting frames from the same episode in both train and validation sets creates data leakage.
  • Format conversion: For LeRobot training, convert HDF5 to Parquet using lerobot.scripts.convert_dataset. For robomimic, use their native HDF5 loader. See our data format guide for details.

Stage 3: Training

Typical training configurations by algorithm:

  • ACT: 2,000–5,000 epochs, learning rate 1e-5 with cosine schedule, batch size 8–16. Training time: 2–8 hours on a single RTX 4090.
  • Diffusion Policy: 500–2,000 epochs, learning rate 1e-4 with cosine schedule. DDPM with 100 diffusion steps for training, DDIM with 10–50 steps for inference. Training time: 4–16 hours on a single RTX 4090.
  • VLA fine-tuning: 20–100 epochs, learning rate 2e-5. LoRA (rank 16–64) for 7B+ models or full fine-tuning for sub-1B models. Training time: 4–24 hours on 1–4 A100 GPUs.

Monitor training loss and validation loss. If validation loss diverges from training loss after a few hundred epochs, you are overfitting — reduce epoch count or add data augmentation.

Stage 4: Evaluation

Validation loss is a weak predictor of real-world performance. The only reliable evaluation is deploying the policy on the real robot and measuring task success rate. Run at least 20 evaluation trials to get a statistically meaningful result (95% confidence interval is approximately ±20% with 20 trials, ±10% with 50 trials). Record every evaluation episode for failure analysis.

Evaluate in two conditions: (1) nominal — same object positions, lighting, and camera angles as training data; (2) perturbed — randomized positions, varied lighting, or unseen object instances. The gap between nominal and perturbed success rates reveals your policy's generalization boundary.

Stage 5: Deployment

Export the trained model to ONNX or TorchScript for deployment. Run inference on the robot's onboard GPU (NVIDIA Jetson Orin for embedded, RTX 4060+ for workstation). Monitor inference latency — if the policy cannot run at the robot's control frequency, use action chunking to predict multiple future actions per inference call.

Log every deployment episode for data flywheel feedback. Include the policy version, observation data, predicted actions, and outcome label (success/failure).

Data Requirements by Algorithm

Task Complexity ACT Diffusion Policy OpenVLA (fine-tune) Octo (fine-tune) Typical Success Rate
Simple pick-place 50 100 100 50 80-95%
Bimanual coordination 100 200 200 100 70-85%
Precise insertion 200 300 300 150 65-80%
Multi-step assembly 500 500 500 300 50-70%
Multi-task (language-cond.) N/A N/A 500-2000 200-1000 40-70%

Hardware requirements for training: ACT and Diffusion Policy train on a single RTX 3060 or better (16 GB VRAM recommended). OpenVLA fine-tuning requires at least 1 A100 (40 GB) or 2 RTX 4090s. Octo fine-tunes on a single RTX 4090. All algorithms benefit from fast NVMe storage for loading image datasets.

5 Common Failure Modes and How to Fix Them

1. Distribution Shift (Compounding Error)

Symptom: The policy works for the first few seconds, then gradually drifts off-trajectory and fails.

Root cause: Small prediction errors compound over time, pushing the robot into states not represented in the training data. The policy has never seen these states and produces increasingly incorrect actions.

Fixes: Use action chunking (ACT predicts 100-step chunks, reducing the number of decision points where errors can compound). Add image augmentation (random crops, color jitter) during training to make the visual encoder robust to slight viewpoint changes. Collect recovery demonstrations: run the policy until it drifts, then teleoperate from the drifted state to task completion. These "DAgger-style" corrections are the highest-value data you can add.

2. Multi-Modality Collapse (Mode Averaging)

Symptom: The robot moves toward a position between two valid strategies, reaching neither. Or the robot freezes at decision points where multiple strategies are equally valid.

Root cause: The training data contains demonstrations with different strategies (approach from left vs. right), and the MSE loss averages them into a single incorrect middle trajectory.

Fixes: Switch to Diffusion Policy, which represents multimodal distributions natively. If staying with ACT, standardize the demonstration strategy — ensure all demonstrations follow the same approach. Remove demonstrations that use minority strategies (fewer than 20% of total). Alternatively, condition the policy on a strategy indicator variable during training.

3. Compounding Errors from Action Prediction

Symptom: The robot produces correct-looking trajectories but accumulates position error over time, resulting in grasps that are 1–3 cm off-target.

Root cause: Action normalization mismatch between training and deployment (different mean/std values), or the policy predicts delta actions that accumulate floating-point error over long trajectories.

Fixes: Verify that the exact same normalization statistics (mean, std, min, max) are used during training and deployment. Prefer absolute joint position targets over delta actions when possible. For delta action policies, periodically re-anchor the predicted trajectory to the current observed state rather than purely integrating deltas.

4. Hardware Mismatch

Symptom: A policy that works well in the original data collection setup fails completely when moved to a different table, different lighting, or a different instance of the same robot model.

Root cause: Camera position shifted (even 5 mm degrades grasping), lighting conditions changed (affecting visual features), or robot joint calibration differs between units.

Fixes: Permanently mount cameras with rigid fixtures and verify extrinsics before every deployment session. Use a calibration target (ArUco board) to check camera pose. For multi-robot deployment, collect a small calibration dataset (20–30 demos) on each robot unit. Apply domain randomization during training: random brightness/contrast, random camera crop, and random background textures.

5. Insufficient Resets

Symptom: The policy appears to have a low success rate, but closer inspection reveals that many failures start from states that are out of the training distribution — objects in wrong positions, robot starting from a non-standard configuration.

Root cause: The evaluation protocol does not properly reset the scene to a state within the training distribution between episodes. Or the training data has a narrow start-state distribution that does not match deployment conditions.

Fixes: Define explicit reset criteria: object positions, robot home configuration, and scene arrangement. Verify reset quality before each evaluation episode. During data collection, systematically randomize start states across the intended deployment distribution. Track start-state coverage as a dataset metric.

Sim-to-Real Assessment

Training in simulation and deploying on real robots is appealing because simulation data is essentially free. But the reality gap makes pure sim-to-real unreliable for most manipulation tasks.

When Simulation Helps

  • Sim pre-training + real fine-tuning: Train on 10,000–100,000 simulated episodes, then fine-tune on 50–200 real episodes. The sim pre-training provides a strong initialization; real fine-tuning bridges the reality gap. This consistently outperforms training on real data alone when real data is scarce.
  • Domain randomization: Randomize visual properties (textures, lighting, camera position) and physics (friction, mass, joint damping) during simulation training. Forces the policy to be robust to variation, some of which overlaps with real-world variation. Most effective when combined with real fine-tuning.
  • Locomotion and whole-body control: Sim-to-real works much better for locomotion than for manipulation. Unitree's G1 and Go2 controllers are trained almost entirely in simulation using Isaac Lab.
  • Policy validation and debugging: Run your trained policy in simulation before deploying on real hardware. Simulation catches gross errors (wrong action space, normalization bugs, control frequency mismatches) without risking hardware damage.

When Simulation Does Not Help

  • Pure sim-to-real for contact-rich manipulation: Simulation cannot accurately model friction, deformation, and contact dynamics of real objects. Policies trained purely in sim fail on grasping soft objects, thin objects, or precision insertion tasks.
  • Simulated camera images as training data: Despite advances in photorealistic rendering, policies trained on simulated images alone show 30–50% success rate degradation on real cameras. Neural rendering (NeRF-based) is improving but not yet production-ready for policy training.
  • Tasks involving deformable objects: Cloth, rope, and food items are extremely difficult to simulate accurately. Sim-trained policies for deformable manipulation rarely transfer to real hardware without extensive real data.

Getting Started Checklist

Follow these 10 steps to go from zero to a deployed imitation learning policy:

  1. Choose your robot hardware. For a first project: ViperX-300 S2, Koch v1.1, or OpenArm. Budget: $2,000–$8,000 for the arm. You need position control at 50 Hz minimum.
  2. Set up teleoperation. Leader-follower arm for highest data quality, or VR controller (Meta Quest 3) for lower cost. See our teleoperation guide.
  3. Mount cameras. Minimum: 1 wrist camera (Intel RealSense D405) + 1 overhead camera (RealSense D435). Use rigid mounts — cameras must not move between collection and deployment.
  4. Install recording software. Use LeRobot's data recording tools or the ALOHA recording stack. Verify HDF5 output includes images, joint positions, and actions at correct frequencies.
  5. Define your first task. Start simple: pick a single object from a fixed location and place it in a target zone. Master this before adding complexity.
  6. Collect 50 demonstrations. Consistent operator, consistent strategy. Take 10 minutes between every 20 episodes to check data quality. Reject and re-record bad episodes immediately.
  7. Train your first ACT policy. Use LeRobot's training script with default hyperparameters. Training should complete in 2–4 hours on a single GPU.
  8. Run 20 evaluation trials. Reset the scene identically before each trial. Record success/failure and save video of each trial for failure analysis.
  9. Diagnose failures and collect targeted data. Identify the most common failure mode. Collect 20–30 demonstrations specifically targeting that failure. Retrain.
  10. Iterate until >80% success rate. Most teams reach 80%+ within 2–3 collection-training cycles for simple tasks. Then begin expanding to data flywheel operations.

Frequently Asked Questions

How many demonstrations do I need?

For ACT on a simple pick-and-place task: 50 demonstrations is often enough for 80%+ success rate. For Diffusion Policy: start with 100–200. For VLA fine-tuning: 100–500 depending on task complexity. In all cases, data quality matters more than quantity — 50 clean demonstrations outperform 500 noisy ones. See the data requirements table above for detailed estimates by task complexity.

ACT vs. Diffusion Policy — which should I choose?

If your task has a single clear strategy and you want the fastest path to deployment, use ACT. If your data contains multiple valid approaches to the same task (multiple operators, ambiguous grasp strategies), use Diffusion Policy. If you need language-conditioned multi-task execution, use a VLA. When in doubt, start with ACT — it trains faster and lets you validate your data collection pipeline before investing in more complex models.

Can I use simulation data?

Simulation data alone is unreliable for contact-rich manipulation. The most effective approach is sim pre-training + real fine-tuning: train on 10,000+ simulated episodes, then fine-tune on 50–200 real episodes. For locomotion tasks, pure sim-to-real with domain randomization works well. For manipulation, always plan to collect some real data.

What robot hardware do I need?

At minimum: a robot arm with position control at 50 Hz (ViperX-300, Koch v1.1, OpenArm, or similar), a teleoperation interface (leader arm or VR controller), 2–3 cameras (wrist + overhead), and a workstation with an NVIDIA GPU (RTX 3060 or better for ACT, RTX 4090 or A100 for Diffusion Policy and VLAs). Total budget for a starter setup: $5,000–$15,000.

What is the difference between imitation learning and reinforcement learning?

Imitation learning trains from human demonstrations — you show the robot what to do. Reinforcement learning trains from trial and error with a reward signal — the robot explores and learns which actions maximize reward. For manipulation, imitation learning is far more sample-efficient (50–500 demos vs. millions of RL steps) and does not require reward engineering. RL dominates for locomotion and tasks with clear reward signals.

How long does training take?

ACT on 200 episodes: 2–8 hours on a single RTX 4090. Diffusion Policy on 500 episodes: 4–16 hours. VLA fine-tuning on 500 episodes: 4–24 hours on 1–4 A100 GPUs. The real bottleneck is usually data collection and evaluation, not training compute.

What data format should I use?

HDF5 is the standard format for robot demonstration data. Store joint positions (qpos), velocities (qvel), camera images, and actions as separate datasets within each episode file. For training with LeRobot, convert to Parquet format. See our data format guide for detailed specifications.

Can imitation learning work for mobile manipulation?

Yes, but it requires significantly more data (500–2,000+ demonstrations) because the state space is larger. VLA models like OpenVLA and pi0 are better suited for mobile manipulation because their pre-training covers diverse embodiments. For tabletop tasks, ACT and Diffusion Policy remain the most practical choices.