VLA Models
History and current state of Vision-Language-Action models
VLA Model List
History and list of Vision-Language-Action models
VLA (Vision-Language-Action) models are AI models that take visual information and language instructions to output robot actions.
Timeline
VLA models have evolved rapidly since RT-1 in 2022.
- 2022: RT-1 (Google)
- 2023: RT-2, ACT, Diffusion Policy
- 2024: Octo, OpenVLA, GR00T, π0
- 2025: SmolVLA, Gemini Robotics, π0.5, GR00T N1.5/N1.6, π*0.6
VLA Foundation Models
Google DeepMind
| Model | Description |
|---|
| RT (Robotics Transformer) | Pioneer of VLA. Started with RT-1, established “Action as Language” paradigm in RT-2, built Open X-Embodiment dataset with RT-X |
| Gemini Robotics | Gemini 2.0-based VLA. Cross-Embodiment support, System 1/2 architecture, On-Device version available |
Physical Intelligence (π Series)
| Model | Description |
|---|
| π Series | Physical Intelligence VLA model series overview |
| π0 | First Generalist Policy with Flow Matching. PaliGemma VLM + 50Hz high-speed control |
| π0.5 | Open-World generalization. Works in new home environments, Web data Co-training |
| π*0.6 | RL-based self-improvement. RECAP methodology achieving 90%+ success rate |
| FAST | DCT + BPE action tokenizer. 10x compression, 5x faster VLA training |
NVIDIA (GR00T Series)
| Model | Description |
|---|
| GR00T | NVIDIA humanoid foundation model series overview. Dual-System architecture |
| GR00T N1 | World’s first open-source humanoid VLA. Proved 40% performance gain with synthetic data |
| GR00T N1.5 | Frozen VLM + FLARE Loss. 2x improvement in language instruction following (46.6% → 93.3%) |
| GR00T N1.6 | 2x DiT scale-up, Cosmos VLM, Relative Action Space. Loco-manipulation support |
Open-Source VLA
| Model | Description |
|---|
| OpenVLA | First large-scale open-source VLA (7B). Performance on par with 55B RT-2-X, efficient fine-tuning with LoRA |
| Octo | 93M lightweight model. Transformer + Diffusion combination, fine-tuning possible on consumer GPU |
| SmolVLA | π0-level performance with 450M. Runs on MacBook, trained on LeRobot community data |
Corporate VLA
| Model | Description |
|---|
| Figure Helix | Figure AI’s humanoid VLA. First full-body high-speed control (200Hz), dual robot simultaneous control |
| LBM (Large Behavior Model) | Boston Dynamics + TRI’s VLA for Atlas. 450M Diffusion Transformer, whole-body single model control |
| CraftNet | Sharpa’s VTLA model. Tactile integration, System 0/1/2 hierarchy, 100Hz fine manipulation |
| Redwood AI | 1X Technologies’ VLA for NEO. 160M parameters, on-board execution, World Model integration |
| Generalist GEN-0 | Claims discovery of robotics scaling laws with 270,000 hours of real data. Harmonic Reasoning architecture |
| Sunday ACT-1 | Zero Robot Data approach. 10M+ episodes collected from 500+ homes using $200 gloves |
Imitation Learning Policy Models
| Model | Description |
|---|
| ACT | Stanford’s Action Chunking policy. 80-90% success rate with 10 minutes of demonstration, released with ALOHA hardware |
| Diffusion Policy | Diffusion-based visuomotor policy. Natural handling of multimodal actions, 46.9% performance improvement |
Vision-Language Models (for Robotics)
| Model | Description |
|---|
| Eagle | NVIDIA’s Mixture of Encoders VLM. Serves as visual brain for GR00T N1/N1.5 |
| Cosmos | NVIDIA’s World Foundation Model platform. Provides Tokenizer, Predict, Transfer, Reason models |
Synthetic Data Generation
| Model | Description |
|---|
| DreamGen | NVIDIA’s Neural Trajectory generation pipeline. Generates GR00T training data in 36 hours using World Foundation Model |
Model Comparison
Parameters and Features
Training Data Scale
| Model | Data Scale | Data Type |
|---|
| Generalist GEN-0 | 270,000 hours | Real robot |
| π0 | 10,000+ hours | Teleoperation |
| Sunday ACT-1 | 10M+ episodes | Gloves (human motion) |
| GR00T N1 | 780K synthetic + real | Simulation + Teleoperation |
| SmolVLA | 10.6M frames | Community data |