VLA Models

History and current state of Vision-Language-Action models

VLA Model List

History and list of Vision-Language-Action models

VLA (Vision-Language-Action) models are AI models that take visual information and language instructions to output robot actions.


Timeline

VLA models have evolved rapidly since RT-1 in 2022.

  • 2022: RT-1 (Google)
  • 2023: RT-2, ACT, Diffusion Policy
  • 2024: Octo, OpenVLA, GR00T, π0
  • 2025: SmolVLA, Gemini Robotics, π0.5, GR00T N1.5/N1.6, π*0.6

VLA Foundation Models

Google DeepMind

ModelDescription
RT (Robotics Transformer)Pioneer of VLA. Started with RT-1, established “Action as Language” paradigm in RT-2, built Open X-Embodiment dataset with RT-X
Gemini RoboticsGemini 2.0-based VLA. Cross-Embodiment support, System 1/2 architecture, On-Device version available

Physical Intelligence (π Series)

ModelDescription
π SeriesPhysical Intelligence VLA model series overview
π0First Generalist Policy with Flow Matching. PaliGemma VLM + 50Hz high-speed control
π0.5Open-World generalization. Works in new home environments, Web data Co-training
π*0.6RL-based self-improvement. RECAP methodology achieving 90%+ success rate
FASTDCT + BPE action tokenizer. 10x compression, 5x faster VLA training

NVIDIA (GR00T Series)

ModelDescription
GR00TNVIDIA humanoid foundation model series overview. Dual-System architecture
GR00T N1World’s first open-source humanoid VLA. Proved 40% performance gain with synthetic data
GR00T N1.5Frozen VLM + FLARE Loss. 2x improvement in language instruction following (46.6% → 93.3%)
GR00T N1.62x DiT scale-up, Cosmos VLM, Relative Action Space. Loco-manipulation support

Open-Source VLA

ModelDescription
OpenVLAFirst large-scale open-source VLA (7B). Performance on par with 55B RT-2-X, efficient fine-tuning with LoRA
Octo93M lightweight model. Transformer + Diffusion combination, fine-tuning possible on consumer GPU
SmolVLAπ0-level performance with 450M. Runs on MacBook, trained on LeRobot community data

Corporate VLA

ModelDescription
Figure HelixFigure AI’s humanoid VLA. First full-body high-speed control (200Hz), dual robot simultaneous control
LBM (Large Behavior Model)Boston Dynamics + TRI’s VLA for Atlas. 450M Diffusion Transformer, whole-body single model control
CraftNetSharpa’s VTLA model. Tactile integration, System 0/1/2 hierarchy, 100Hz fine manipulation
Redwood AI1X Technologies’ VLA for NEO. 160M parameters, on-board execution, World Model integration
Generalist GEN-0Claims discovery of robotics scaling laws with 270,000 hours of real data. Harmonic Reasoning architecture
Sunday ACT-1Zero Robot Data approach. 10M+ episodes collected from 500+ homes using $200 gloves

Imitation Learning Policy Models

ModelDescription
ACTStanford’s Action Chunking policy. 80-90% success rate with 10 minutes of demonstration, released with ALOHA hardware
Diffusion PolicyDiffusion-based visuomotor policy. Natural handling of multimodal actions, 46.9% performance improvement

Vision-Language Models (for Robotics)

ModelDescription
EagleNVIDIA’s Mixture of Encoders VLM. Serves as visual brain for GR00T N1/N1.5
CosmosNVIDIA’s World Foundation Model platform. Provides Tokenizer, Predict, Transfer, Reason models

Synthetic Data Generation

ModelDescription
DreamGenNVIDIA’s Neural Trajectory generation pipeline. Generates GR00T training data in 36 hours using World Foundation Model

Model Comparison

Parameters and Features

ModelParametersOpen SourceFeatures
π03.3BOFlow Matching, 50Hz
GR00T N12.2BODual-System, Humanoid
OpenVLA7BOPrismatic VLM, LoRA
SmolVLA450MORuns on MacBook
Octo93MODiffusion decoder
Gemini Robotics-XGemini 2.0 based
Figure Helix-X200Hz high-speed control

Training Data Scale

ModelData ScaleData Type
Generalist GEN-0270,000 hoursReal robot
π010,000+ hoursTeleoperation
Sunday ACT-110M+ episodesGloves (human motion)
GR00T N1780K synthetic + realSimulation + Teleoperation
SmolVLA10.6M framesCommunity data