Octo

UC Berkeley's Open-Source Generalist Robot Foundation Model

Octo

Home > Models > Octo


Key Significance

  • Ultra-Lightweight High Performance: 93M parameters achieving similar performance to 55B RT-2-X - highest efficiency per parameter
  • Diffusion-Based Action Generation: Transformer + Diffusion combination for handling multimodal action distribution
  • Practical Fine-tuning: Adapts to new robots/tasks with ~100 demonstrations on consumer GPU in just hours
  • Flexible I/O: Task specification via language instructions or goal images, supports various observation/action spaces
  • Fully Open-Source: Complete release of checkpoints, training code, and fine-tuning scripts
  • Open X-Embodiment Utilization: Pretrained on 800K episodes from 25 datasets
  • Standard for Fast Adaptation: Used as benchmark when needing to quickly adapt to new robot platforms

Octo Architecture

Octo Architecture: Transformer Encoder + Diffusion Decoder Structure


Overview

Octo is an open-source generalist robot policy jointly developed by UC Berkeley, Stanford, and CMU. Pretrained on 800K episodes from the Open X-Embodiment dataset, it can be quickly fine-tuned on various robot platforms.

ItemDetails
PublishedMay 2024 (RSS 2024)
AffiliationUC Berkeley, Stanford, CMU
PaperarXiv:2405.12213
Projectocto-models.github.io
GitHubgithub.com/octo-models/octo
LicenseOpen Source

Model Variants

ModelParametersUse Case
Octo-Small27MLightweight, fast experiments
Octo-Base93MHigher performance

Architecture

Octo is a Transformer-based Diffusion Policy.

+-------------------------------------------------------------+
|                      Octo Architecture                       |
+-------------------------------------------------------------+
|  Inputs:                                                    |
|  +----------+  +----------+  +----------+                   |
|  | Images   |  | Language |  | Goal     |                   |
|  | (multi)  |  | Instruct |  | Image    |                   |
|  +----+-----+  +----+-----+  +----+-----+                   |
|       |             |             |                          |
|       +-------------+-------------+                          |
|                     |                                        |
|              +------v------+                                 |
|              | Transformer |                                 |
|              |   Encoder   |                                 |
|              +------+------+                                 |
|                     |                                        |
|              +------v------+                                 |
|              |  Diffusion  |                                 |
|              |   Decoder   |                                 |
|              +------+------+                                 |
|                     |                                        |
|              +------v------+                                 |
|              |   Action    |                                 |
|              |  Sequence   |                                 |
|              +-------------+                                 |
+-------------------------------------------------------------+

Supported Features:

  • Task specification via natural language instruction or goal image
  • Observation history
  • Multimodal action distribution through diffusion decoding

Training Data

ItemDetails
DatasetOpen X-Embodiment
Episodes800K
Number of Datasets25
Robot TypesVarious (single arm, bimanual, etc.)
SensorsCamera, proprioception, etc.

Performance

Zero-Shot (Pretrained Environment)

RobotSuccess Rate
WidowX50%
UR570%
RT-1 Robot80%

Comparison:

  • Superior to RT-1-X
  • Similar to RT-2-X (55B) (but Octo is only 93M)

After Fine-tuning (Average across 6 tasks)

ModelSuccess Rate
Octo72%
VC-115%

52% improvement over second-best baseline


Fine-tuning Capabilities

Octo’s key strength is fast adaptation.

Adaptable ElementExamples
New ObservationsForce-torque, proprioception
New Action SpacesJoint position control
New RobotsBimanual systems, etc.

Requirements:

  • ~100 target demonstrations
  • Few hours on general consumer GPU

Key Advantages

FeatureDescription
Open-SourceComplete release of checkpoints, training code, fine-tuning scripts
FlexibilitySupports various observation/action spaces
Efficiency93M parameters achieving 55B model-level performance
PracticalityFine-tuning possible on consumer GPU

Comparison with RT-X

ItemOctoRT-1-XRT-2-X
Parameters93M~35M55B
Open-SourceOOX
PerformanceHighMediumHigh
Fine-tuningEasyMediumDifficult

Released Resources

  • Pretrained checkpoints (27M, 93M)
  • Fine-tuning scripts
  • Complete pretraining pipeline
  • Evaluation code

References


See Also