ACT (Action Chunking with Transformers)

Stanford's Action Chunking-based Imitation Learning Policy

ACT (Action Chunking with Transformers)

Home > Models > ACT


Key Significance

  • Action Chunking Concept: Inspired by psychology, groups continuous actions into single units (chunks) for execution - mitigates compounding error
  • Extreme Data Efficiency: Achieves 80-90% success rate with just 10 minutes of demonstration data - breakthrough in precision manipulation
  • Low-Cost ALOHA Hardware: Enables bimanual dexterous manipulation system for ~$20K with modular design for easy maintenance
  • New Standard for Bimanual Dexterous Manipulation: Performs precision tasks previously difficult like zip tie insertion and battery placement
  • LeRobot Default Recommended Model: Adopted as the default recommended model in HuggingFace LeRobot
  • Fast Training with Low Compute: Trainable on standard GPUs with short training time
  • CVAE-Based Architecture: Style variable (z) captures diverse demonstration styles, uses prior mean at inference

ACT Demo: Battery Slot Insertion - Precise Bimanual Manipulation


Overview

ACT (Action Chunking with Transformers) is an imitation learning algorithm developed at Stanford. Released alongside the low-cost hardware system ALOHA, it demonstrated that bimanual dexterous manipulation can be learned with just 10 minutes of demonstration data.

ItemDetails
PublishedApril 2023 (RSS 2023)
AuthorsTony Zhao, Vikash Kumar, Sergey Levine, Chelsea Finn
AffiliationStanford University
PaperarXiv:2304.13705
Projecttonyzhaozh.github.io/aloha

Key Ideas

Action Chunking

A concept inspired by psychology that groups continuous actions into single units (chunks) for execution.

Traditional Behavior Cloning:

Observation → Policy → Next 1 action

ACT’s Action Chunking:

Observation → Policy → Next k action sequence (e.g., 90 timesteps)

Advantages:

  • Reduces effective task horizon by k times
  • Mitigates compounding error
  • Generates smoother motions

Temporal Ensembling

Queries the policy more frequently and averages overlapping action chunks for even smoother execution.


Architecture

ACT is trained as a Conditional VAE (CVAE) decoder.

ACT Architecture: CVAE-based, encodes style variable z during training, uses z=0 at inference

Inputs:

  • 4 RGB camera images (480x640)
  • Joint positions

Outputs:

  • 90 timestep action sequence
  • 50Hz control frequency

ALOHA Hardware

Low-cost bimanual manipulation system released alongside ACT.

ItemDetails
Total Cost~$20,000
Robot ArmsViperX 6-DoF x 2 (each ~$5,600)
Payload750g
Workspace1.5m span
Accuracy5-8mm
FeaturesModular, Dynamixel motors (easy replacement)

Performance

Task success rates trained with 50 demonstrations:

TaskSuccess Rate
Task 196%
Task 284%
Task 364%
Task 492%

Demonstration Data Efficiency:

  • 10 minutes of demonstration data achieves 80-90% success rate
  • Performs precision tasks like zip tie insertion and battery placement

Demonstrated Tasks

  • Opening transparent sauce cup
  • Inserting battery into slot
  • Ping pong ball juggling (dynamic task)
  • Chain assembly (high-contact task)
  • Zip tie insertion (precision task)

ACT Demo: Transparent Sauce Cup Manipulation - Reactive Bimanual Coordination


Impact & Adoption

ACT is widely adopted for the following reasons:

  • Fast Training: Short training time
  • Low Compute Requirements: Trainable on standard GPUs
  • Strong Performance: High success rates in precision manipulation
  • LeRobot Integration: Default recommended model in HuggingFace LeRobot

Follow-up Research

ModelDescription
ALOHA 2Mobile ALOHA, improved hardware
Bi-ACTExtension based on Bilateral Control

References


See Also