FAST (Fast Action Tokenizer)

Physical Intelligence's DCT + BPE Based Robot Action Tokenizer - 10x Compression, 5x Faster VLA Training

Author’s Note

  • Discrete Action Token + Compression Potential. Shows that even with discrete action tokens, good compression can shorten pretraining time.
  • Maximizing LLM Capabilities. Autoregressive structure allows better leveraging of LLM’s language understanding abilities.
  • Advantageous for Research Stage. Has the drawback of slow inference, making it more suitable for research/training phases rather than actual deployment.

Key Significance

  • Breakthrough DCT + BPE Compression: Combines DCT (used in JPEG/MP3) with BPE (LLM tokenizers) to achieve approximately 10x compression
  • 5x Faster VLA Training: Dramatically reduced training time compared to diffusion-based models
  • High-Frequency Dexterous Task Support: Enables learning of high-frequency precise manipulation tasks impossible with standard binning
  • Diffusion-Level Dexterity: Achieves similar levels of dexterity as flow matching/diffusion approaches
  • Improved Language Instruction Following: Autoregressive structure better transfers internet-scale pretraining language understanding
  • Universal Tokenizer FAST+: Trained on 1 million real robot trajectories, immediately applicable to diverse action spaces/frequencies

Overview

FAST (Frequency-space Action Sequence Tokenization) is a robot action tokenizer published by Physical Intelligence in January 2025. It overcomes the limitations of simple per-dimension binning used by existing VLA models, enabling effective autoregressive model training even on high-frequency dexterous manipulation tasks.

ItemDetails
PublishedJanuary 16, 2025
CompanyPhysical Intelligence
PaperarXiv:2501.09747
Blogpi.website/research/fast
ModelHuggingFace: physical-intelligence/fast

Why FAST is Needed

Problems with Existing Tokenization

Existing VLA models (OpenVLA, RT-2, etc.) tokenized robot actions using per-dimension, per-timestep binning:

ProblemDescription
Token ExplosionGenerates enormous token sequences at high frequencies (50Hz+)
Dexterous FailureCannot learn high-frequency tasks like precise finger manipulation
Inefficient TrainingIncreased training time due to long sequences
Weakened Language ConnectionGap between action tokens and language tokens

FAST’s Solution

Compression-based Approach: Compress action sequences first, then tokenize


Technical Architecture

FAST Tokenizer Pipeline

FAST Tokenizer: DCT → Quantize → Flatten → BPE Compression Process

5-Stage Compression Pipeline

StageNameDescription
1Normalized Action ChunkNormalize raw action sequence
2DCT (Discrete Cosine Transform)Time domain → frequency domain transform (same principle as JPEG/MP3)
3QuantizeQuantize frequency components to discrete values → Sparse frequency matrix
4FlattenFlatten to 1D array with low-frequency components first
5BPE (Byte Pair Encoding)Merge frequent patterns into new tokens for final compression

Note: This is a lossy compression approach where information loss occurs at the Quantization stage. Similar to JPEG image compression, it achieves higher compression ratios by discarding some high-frequency components.

Compression Results

MetricValue
Compression Ratio~10x
Tokens per Chunk30-60
Implementation Complexity3 lines of code

FAST+ Universal Tokenizer

Training Data

ItemDetails
Training Data1 million real robot trajectories
Data SourcesVarious robot platforms
Action SpacesVarious DoF, control frequencies

Features

  • Zero-shot Application: Immediately usable on new robots
  • Universality: Supports various action spaces and control frequencies
  • Pre-trained: No separate tokenizer training required

Performance Comparison

vs Diffusion/Flow Matching Based VLA

ItemFAST (Autoregressive)Diffusion/Flow Matching
Training Speed5x fasterBaseline
DexteritySimilar levelSimilar level
Inference SpeedSlower (autoregressive)Faster
Language UnderstandingBetterBaseline

Validated Tasks

Complex manipulation tasks successfully performed by FAST-trained policies:

TaskDescription
Laundry foldingFolding clothes and towels
Table bussingClearing and organizing tables
Grocery baggingPacking groceries into bags

DROID Dataset Results

  • First Zero-shot Generalization: First generalist policy trained on DROID dataset
  • Multi-environment Deployment: Validated at UC Berkeley, Stanford, University of Washington

Pi0-FAST Integration

Variant model applying FAST to Pi0:

FeaturePi0 (Flow Matching)Pi0-FAST (Autoregressive)
Action GenerationFlow matchingFAST token autoregressive
Training SpeedBaseline5x faster
Inference CostBaseline4-5x higher
Language UnderstandingGoodBetter

Limitations

Inference Speed

IssueDescription
Autoregressive DecodingTokens must be generated sequentially
Slower than Pi0Slower than flow matching’s parallel decoding
Increased Inference CostConsideration needed for real-time control

Suitable Use Cases

  • Research environments where training time is critical
  • Tasks where language instruction understanding is important
  • Offline batch training

Technical Details

Authors

NameAffiliation
Karl PertschPhysical Intelligence
Kyle StachowiczPhysical Intelligence
Brian IchterPhysical Intelligence
Danny DriessPhysical Intelligence
Suraj NairPhysical Intelligence
Quan VuongPhysical Intelligence
Oier MeesPhysical Intelligence
Chelsea FinnPhysical Intelligence
Sergey LevinePhysical Intelligence

Scaling

ItemValue
Training Data10,000+ hours
Robot Trajectories1M+
Supported RobotsVarious platforms

References


See Also