Pi0 (pi-zero)

Physical Intelligence's First Generalist Policy - Flow Matching Based VLA

Pi0 (pi-zero)

Home > Models > Pi Series > Pi0


Key Significance

  • Successful Application of Flow Matching: First successful application of flow matching to robotics as an alternative to Diffusion
  • VLM Knowledge Transfer to Robots: Leverages PaliGemma (3B) VLM’s internet-scale knowledge for dexterous manipulation
  • 50Hz High-Frequency Control: Generates 50 motor commands per second via Action Chunking
  • 8 Robot Platforms: Learns across diverse embodiments including single arm, bimanual, and mobile manipulators
  • Dominates OpenVLA/Octo: Overwhelming performance advantage over existing open-source VLAs in complex dexterous tasks
  • Open-Source Release: Complete release of weights, training code, JAX/PyTorch implementation via openpi

Pi0 Overview

Pi0 Architecture: PaliGemma VLM + Flow Matching Action Expert


Overview

Pi0 (pi-zero) is the first general-purpose robot foundation model announced by Physical Intelligence in October 2024 after 8 months of development. Founded by researchers who led Google DeepMind’s RT series, they presented a new VLA paradigm based on Flow Matching.

ItemDetails
PublishedOctober 31, 2024
CompanyPhysical Intelligence
PaperarXiv:2410.24164
Blogpi.website/blog/pi0
GitHubPhysical-Intelligence/openpi

Architecture

Pi0 is a VLM + Flow Matching Action Expert hybrid architecture.

+-------------------------------------------------------------+
|                      Pi0 Architecture                        |
+-------------------------------------------------------------+
|                                                              |
|   +----------------------------------------------------+    |
|   |              PaliGemma VLM (3B)                    |    |
|   |         Internet-scale Semantic Knowledge          |    |
|   |    * Image understanding    * Language instruction |    |
|   +------------------------+---------------------------+    |
|                            |                                 |
|                            v                                 |
|   +----------------------------------------------------+    |
|   |            Action Expert (+300M)                   |    |
|   |    * Proprioceptive states processing              |    |
|   |    * Bidirectional attention between action tokens |    |
|   |    * Separate Transformer weights                  |    |
|   +------------------------+---------------------------+    |
|                            |                                 |
|                            v                                 |
|   +----------------------------------------------------+    |
|   |              Flow Matching                         |    |
|   |    * Continuous action distribution generation     |    |
|   |    * Multimodal action handling                    |    |
|   |    * 50Hz high-frequency control                   |    |
|   +----------------------------------------------------+    |
|                                                              |
+-------------------------------------------------------------+

Model Specifications

ComponentSpec
VLM BackbonePaliGemma (3B)
Action Expert+300M parameters
Total Parameters~3.3B
Control Frequency50Hz
Action Horizon50 steps (1 second)

What is Flow Matching?

Alternative to Diffusion for modeling continuous distributions:

FeatureDescription
Continuous DistributionHandles complex multimodal action distributions
EfficiencyFaster sampling than Diffusion
Transformer IntegrationNatural combination with VLM
High-Frequency ControlSuitable for action chunk generation

Action Expert

Module handling robot control separately from VLM:

  • 300M additional parameters: Separate Transformer weights
  • Proprioceptive processing: Robot state information encoding
  • Bidirectional attention: Ensures consistency between action tokens
  • Continuous output: Generates continuous commands via flow matching

Training Data

Pi Dataset

Dexterous manipulation data directly collected by Physical Intelligence:

ItemDetails
Total Data10,000+ hours
Robot Platforms8
Tasks68

Supported Robot Platforms

RobotType
UR5eSingle arm
Bimanual UR5eBimanual
FrankaSingle arm
Bimanual TrossenBimanual
Bimanual ARXBimanual
Mobile TrossenMobile manipulator
Mobile FibocomMobile manipulator

Task Examples

  • Laundry folding
  • Coffee preparation
  • Grocery bagging
  • Table bussing
  • Cable routing
  • Box assembly
  • Power plug insertion

External Data

  • Open X-Embodiment (OXE): Includes various robot data
  • Internet Pretraining: Visual-language knowledge via PaliGemma VLM

Performance

vs OpenVLA, Octo (Zero-shot)

Comparison on complex multi-stage dexterous tasks:

TaskPi0Pi0-SmallOpenVLAOcto
Bussing Easy (UR5e)97.1%44.3%0%4.3%
Bussing Hard (UR5e)87.5%33.3%0%0%
Shirt Folding (Bi-ARX)100%50%0%0%
Grocery Bagging (UR5e)78.6%27.1%0%0%
Toast from Toaster75%0%0%0%

Effect of VLM Pre-training

ComparisonResult
Pi0 (full) vs Pi0-Small2x+ performance improvement
ReasonVisual-language knowledge from VLM pretraining

Key Insights

  • OpenVLA/Octo at 0%: Fail on complex dexterous tasks
  • Only Pi0 succeeds: Effectiveness of Flow matching + VLM combination
  • Generalization ability: Consistent performance across various robots

Capabilities

Zero-shot Performance

Tasks performable with pretraining alone:

  • Manipulation in settings similar to training environment
  • Following language instructions
  • Basic object manipulation

After Fine-tuning

Specialization with small additional data:

TaskRequired Data
Laundry folding~few hours
Box assembly~few hours
Complex manipulation1-20 hours

Adaptive Behavior

  • Recovery after human intervention
  • Retry after failure
  • Handling various object shapes

Deployment Modes

1. Zero-shot

Language instruction -> Pi0 -> Robot action
  • Use immediately without additional training
  • Suitable for tasks within training distribution

2. Fine-tuning

Small demonstration data -> Pi0 fine-tuning -> Specialized Pi0
  • 1-20 hours of data sufficient
  • Adapts to new tasks/environments

3. Language-Conditioned

High-level VLM planning -> Pi0 execution
  • External VLM generates high-level plan
  • Pi0 handles low-level execution

Open Source Release

Released through openpi repository in February 2025:

Released Models

ModelDescription
Pi0 basePretrained model for fine-tuning
Pi0-FAST baseFAST tokenizer applied (5x faster training)
Pi0 DROIDFranka single arm fine-tuned
Pi0 ALOHABimanual manipulation fine-tuned
Pi0 LiberoSimulation environment fine-tuned

Provided Resources

  • JAX original implementation
  • PyTorch implementation (HuggingFace LeRobot)
  • Fine-tuning scripts
  • Inference code

Pi0-FAST

Autoregressive version with FAST tokenizer applied:

FeatureDetails
Training Speed5x faster
Language UnderstandingBetter instruction following
Inference Cost4-5x higher

Variants

VariantMethodFeatures
Pi0Flow MatchingFast inference, continuous action
Pi0-FASTAutoregressiveFast training, better language understanding
Pi0-SmallFlow MatchingNo VLM, lightweight

Subsequent Versions

Evolved versions after Pi0:

VersionReleasedKey Improvements
Pi0.52025.04Open-world generalization
Pi*0.62025.11RL self-improvement

Full series overview: Pi Series


References


See Also