Octo
Home > Models > Octo
Key Significance
- Ultra-Lightweight High Performance: 93M parameters achieving similar performance to 55B RT-2-X - highest efficiency per parameter
- Diffusion-Based Action Generation: Transformer + Diffusion combination for handling multimodal action distribution
- Practical Fine-tuning: Adapts to new robots/tasks with ~100 demonstrations on consumer GPU in just hours
- Flexible I/O: Task specification via language instructions or goal images, supports various observation/action spaces
- Fully Open-Source: Complete release of checkpoints, training code, and fine-tuning scripts
- Open X-Embodiment Utilization: Pretrained on 800K episodes from 25 datasets
- Standard for Fast Adaptation: Used as benchmark when needing to quickly adapt to new robot platforms

Octo Architecture: Transformer Encoder + Diffusion Decoder Structure
Overview
Octo is an open-source generalist robot policy jointly developed by UC Berkeley, Stanford, and CMU. Pretrained on 800K episodes from the Open X-Embodiment dataset, it can be quickly fine-tuned on various robot platforms.
| Item | Details |
|---|---|
| Published | May 2024 (RSS 2024) |
| Affiliation | UC Berkeley, Stanford, CMU |
| Paper | arXiv:2405.12213 |
| Project | octo-models.github.io |
| GitHub | github.com/octo-models/octo |
| License | Open Source |
Model Variants
| Model | Parameters | Use Case |
|---|---|---|
| Octo-Small | 27M | Lightweight, fast experiments |
| Octo-Base | 93M | Higher performance |
Architecture
Octo is a Transformer-based Diffusion Policy.
+-------------------------------------------------------------+
| Octo Architecture |
+-------------------------------------------------------------+
| Inputs: |
| +----------+ +----------+ +----------+ |
| | Images | | Language | | Goal | |
| | (multi) | | Instruct | | Image | |
| +----+-----+ +----+-----+ +----+-----+ |
| | | | |
| +-------------+-------------+ |
| | |
| +------v------+ |
| | Transformer | |
| | Encoder | |
| +------+------+ |
| | |
| +------v------+ |
| | Diffusion | |
| | Decoder | |
| +------+------+ |
| | |
| +------v------+ |
| | Action | |
| | Sequence | |
| +-------------+ |
+-------------------------------------------------------------+
Supported Features:
- Task specification via natural language instruction or goal image
- Observation history
- Multimodal action distribution through diffusion decoding
Training Data
| Item | Details |
|---|---|
| Dataset | Open X-Embodiment |
| Episodes | 800K |
| Number of Datasets | 25 |
| Robot Types | Various (single arm, bimanual, etc.) |
| Sensors | Camera, proprioception, etc. |
Performance
Zero-Shot (Pretrained Environment)
| Robot | Success Rate |
|---|---|
| WidowX | 50% |
| UR5 | 70% |
| RT-1 Robot | 80% |
Comparison:
- Superior to RT-1-X
- Similar to RT-2-X (55B) (but Octo is only 93M)
After Fine-tuning (Average across 6 tasks)
| Model | Success Rate |
|---|---|
| Octo | 72% |
| VC-1 | 15% |
→ 52% improvement over second-best baseline
Fine-tuning Capabilities
Octo’s key strength is fast adaptation.
| Adaptable Element | Examples |
|---|---|
| New Observations | Force-torque, proprioception |
| New Action Spaces | Joint position control |
| New Robots | Bimanual systems, etc. |
Requirements:
- ~100 target demonstrations
- Few hours on general consumer GPU
Key Advantages
| Feature | Description |
|---|---|
| Open-Source | Complete release of checkpoints, training code, fine-tuning scripts |
| Flexibility | Supports various observation/action spaces |
| Efficiency | 93M parameters achieving 55B model-level performance |
| Practicality | Fine-tuning possible on consumer GPU |
Comparison with RT-X
| Item | Octo | RT-1-X | RT-2-X |
|---|---|---|---|
| Parameters | 93M | ~35M | 55B |
| Open-Source | O | O | X |
| Performance | High | Medium | High |
| Fine-tuning | Easy | Medium | Difficult |
Released Resources
- Pretrained checkpoints (27M, 93M)
- Fine-tuning scripts
- Complete pretraining pipeline
- Evaluation code