ACT (Action Chunking with Transformers)

Stanford's Action Chunking-based Imitation Learning Policy

Author’s Note

  • The savior of countless demos. Anyone can easily collect dozens of teleop data samples and train ACT to create working demonstrations.
  • At the numerous exhibitions and conference demo booths I visited in 2025, most were built with ACT.

ACT Demo: Battery Slot Insertion - Precise Bimanual Manipulation

Key Significance

  • Action Chunking Concept: Inspired by psychology, groups continuous actions into single units (chunks) for execution - mitigates compounding error
  • Extreme Data Efficiency: Reports ~80–90% success on some tasks with ~10 minutes of demonstrations (task/data-regime dependent)
  • Low-Cost ALOHA Hardware: Enables bimanual dexterous manipulation system for ~$20K with modular design for easy maintenance
  • New Standard for Bimanual Dexterous Manipulation: Performs precision tasks previously difficult like zip tie insertion and battery placement
  • LeRobot Default Recommended Model: Adopted as the default recommended model in HuggingFace LeRobot
  • Fast Training with Low Compute: Trainable on standard GPUs with short training time
  • CVAE-Based Architecture: Style variable (z) captures diverse demonstration styles, uses prior mean at inference

Overview

ACT (Action Chunking with Transformers) is an imitation learning algorithm developed at Stanford. Released alongside the low-cost hardware system ALOHA, it reports that some bimanual dexterous manipulation tasks can be learned with ~10 minutes of demonstration data (task-dependent).

ItemDetails
PublishedApril 2023 (RSS 2023)
AuthorsTony Zhao, Vikash Kumar, Sergey Levine, Chelsea Finn
AffiliationStanford University
PaperarXiv:2304.13705
Projecttonyzhaozh.github.io/aloha

Key Ideas

Action Chunking

A concept inspired by psychology that groups continuous actions into single units (chunks) for execution.

Traditional Behavior Cloning:

Observation → Policy → Next 1 action

ACT’s Action Chunking:

Observation → Policy → Next k action sequence (e.g., 90 timesteps)

Advantages:

  • Reduces effective task horizon by k times
  • Mitigates compounding error
  • Generates smoother motions

Temporal Ensembling

Queries the policy more frequently and averages overlapping action chunks for even smoother execution.


Architecture

ACT is trained as a Conditional VAE (CVAE) decoder.

ACT Architecture

ACT Architecture: CVAE-based, encodes style variable z during training, uses z=0 at inference

Inputs:

  • 4 RGB camera images (480x640)
  • Joint positions

Outputs:

  • 90 timestep action sequence
  • 50Hz control frequency

ALOHA Hardware

Low-cost bimanual manipulation system released alongside ACT.

ItemDetails
Total Cost~$20,000
Robot ArmsViperX 6-DoF x 2 (each ~$5,600)
Payload750g
Workspace1.5m span
Accuracy5-8mm
FeaturesModular, Dynamixel motors (easy replacement)

Performance

Task success rates trained with 50 demonstrations:

TaskSuccess Rate
Task 196%
Task 284%
Task 364%
Task 492%

Demonstration Data Efficiency:

  • Reports ~80–90% success on some tasks with ~10 minutes of demonstrations (task-dependent)
  • Performs precision tasks like zip tie insertion and battery placement

Demonstrated Tasks

  • Opening transparent sauce cup
  • Inserting battery into slot
  • Ping pong ball juggling (dynamic task)
  • Chain assembly (high-contact task)
  • Zip tie insertion (precision task)

ACT Demo: Transparent Sauce Cup Manipulation - Reactive Bimanual Coordination


Impact & Adoption

ACT is widely adopted for the following reasons:

  • Fast Training: Short training time
  • Low Compute Requirements: Trainable on standard GPUs
  • Strong Performance: High success rates in precision manipulation
  • LeRobot Integration: Default recommended model in HuggingFace LeRobot

Follow-up Research

ModelDescription
ALOHA 2Mobile ALOHA, improved hardware
Bi-ACTExtension based on Bilateral Control

References


See Also