OpenVLA

Stanford/Berkeley's 7B Open-Source Vision-Language-Action Model

OpenVLA

Home > Models > OpenVLA


Key Significance

  • Democratization of VLA Research: First large-scale open-source VLA (7B parameters) - complete release of checkpoints, code, and training pipeline
  • Efficient Performance: Achieves comparable or better performance than 55B RT-2-X with 7B, proving efficiency relative to model size
  • Consumer GPU Fine-tuning: LoRA trains only 1.4% of total parameters, enabling fine-tuning on general GPUs
  • Versatility: Only model showing 50%+ success rate across all tested tasks
  • Prismatic VLM Based: Strong visual understanding with SigLIP + DinoV2 dual vision encoder
  • Result of Academic Collaboration: Jointly developed by 5 institutions - Stanford, UC Berkeley, TRI, Google DeepMind, MIT
  • Starting Point for Open-Source VLA Ecosystem: Foundation for subsequent lightweight open-source VLA research like SmolVLA, MiniVLA

OpenVLA Architecture

OpenVLA Architecture: Based on Prismatic VLM (SigLIP + DinoV2) + Llama 2 7B


Overview

OpenVLA is a 7B parameter open-source VLA model jointly developed by Stanford, UC Berkeley, Toyota Research Institute, Google DeepMind, and MIT. Smaller than RT-2-X (55B) but shows similar or better performance, and can be fine-tuned on consumer GPUs.

ItemDetails
PublishedJune 2024
AffiliationStanford, UC Berkeley, TRI, Google DeepMind, MIT
PaperarXiv:2406.09246
Projectopenvla.github.io
GitHubgithub.com/openvla/openvla
HuggingFaceopenvla/openvla-7b
Parameters7B

Architecture

OpenVLA is built on Prismatic-7B VLM.

+-------------------------------------------------------------+
|                    OpenVLA Architecture                      |
+-------------------------------------------------------------+
|                                                              |
|  +-------------+  +-------------+                            |
|  |   SigLIP    |  |   DinoV2    |   Visual Encoder          |
|  |  Backbone   |  |  Backbone   |   (Fused)                 |
|  +------+------+  +------+------+                            |
|         |                |                                   |
|         +-------+--------+                                   |
|                 v                                            |
|         +-------------+                                      |
|         |  Projector  |   Image -> LLM space                |
|         +------+------+                                      |
|                v                                             |
|  +-------------------------+                                 |
|  |      Llama 2 7B         |   Language Backbone            |
|  |   (Action Prediction)   |                                 |
|  +-----------+-------------+                                 |
|              v                                               |
|       Tokenized Actions -> Continuous Robot Commands         |
+-------------------------------------------------------------+
ComponentDescription
Visual EncoderSigLIP + DinoV2 (fused)
ProjectorVisual embeddings → LLM input space
LLM BackboneLlama 2 7B
OutputTokenized actions → Continuous robot commands

Training

ItemDetails
DatasetOpen X-Embodiment
Episodes970K
Hardware64x A100 GPU
Training Period15 days
Data Sources21 institutions, 22 robot forms

Performance

Zero-Shot Evaluation

ComparisonResult
vs RT-1-XSuperior (WidowX, Google Robot)
vs OctoSuperior (WidowX, Google Robot)
vs RT-2-X (55B)Comparable or better

Key: 7B model matches or outperforms 55B model

After Fine-tuning

  • Superior to Octo on most Franka-Tabletop and DROID tasks
  • 50%+ success rate on all tested tasks (unique)
  • Outperforms Diffusion Policy on multi-object and language-based tasks

Limitations

  • RT-2-X performs better on difficult semantic generalization (internet concepts)
  • Reason: RT-2-X’s larger pretraining data and co-fine-tuning strategy

Fine-tuning

One of OpenVLA’s key strengths is efficient fine-tuning.

LoRA (Low-Rank Adaptation)

ItemDetails
Fine-tuning ParametersOnly 1.4% of total
PerformanceEquivalent to full fine-tuning
HardwareConsumer GPU capable

Quantization

  • No performance degradation after quantization
  • Enables efficient serving

Comparison with Other Models

ModelParametersOpen-SourceFeatures
OpenVLA7BOVLM-based, strong at language instructions
Octo93MODiffusion-based, fast fine-tuning
RT-2-X55BXLargest model, strong semantic generalization
RT-1-X~35MOLightweight, basic performance

Key Advantages

FeatureDescription
Performance7B model matching 55B model
EfficiencyLoRA trains only 1.4% of parameters
Versatility50%+ success rate on all tasks
AccessibilityConsumer GPU fine-tuning/serving
Open-SourceComplete release of checkpoints and code

Variants

ModelDescription
OpenVLA-7BBase model
MiniVLALightweight version (Stanford ILIAD)

References


See Also