VLA & RFM Progress

The ongoing development of Vision-Language-Action models and Robot Foundation Models

Author’s Note

Starting from 2024, VLA development accelerated rapidly, and in 2025, remarkable convergent evolution occurred. Different research groups independently arrived at similar conclusions.

The most impressive aspect is the convergence on System 1/2 architecture and continuous action generation methods. I believe this is not mere coincidence, but an inevitable consequence stemming from the very nature of the physical intelligence problem.

A chronological overview of VLA and RFM development, presented in reverse chronological order. From the latest models to the foundational research that made them possible.


2026: The Year of Expansion

Figure Helix 02 - Full Body Loco-Manipulation

Figure Helix is the first VLA to achieve high-speed control of a full humanoid body.

FeatureSignificance
Full Body ControlControls entire upper body at 200Hz—wrists, torso, head, and individual fingers
Loco-ManipulationManipulates 21 objects while moving 130+ feet (Table-to-Dishwasher task)
Dual RobotTwo Figure 02 robots simultaneously solve shared long-horizon manipulation tasks

While previous VLAs primarily focused on tabletop manipulation, Helix 02 is the first example of integrating locomotion with manipulation. Seeing a humanoid actually move through space while performing complex tasks marks a new chapter in robotics.

Sharpa - The Beginning of Tactile-Based VLA

In early 2026, Sharpa announced CraftNet, a model that integrates tactile sensing into VLA, opening new research directions.

While existing VLAs relied primarily on visual information, adding tactile sensing is expected to enable more delicate manipulation.

For more on the necessity and challenges of tactile sensing, see The Need for Tactile Sensing.


2025: The Year of Convergent Evolution

2025 was a year when Convergent Evolution became prominent in VLA research. Different research groups independently arrived at similar architectures.

Convergence 1: System 1 / System 2 Architecture

A dual-system structure inspired by Daniel Kahneman’s “Thinking, Fast and Slow” was adopted by multiple models.

SystemRoleFrequencyCharacteristics
System 2High-level planning, language/visual understanding7-10 HzSlow thinking, VLM-based
System 1Low-level motor control100-200 HzFast thinking, real-time response

Major models adopting this structure:

ModelReleaseSystem 2System 1Frequency
GR00T N1.62025.09Cosmos-Reason-2B VLMDiT 32 layers (120Hz)120Hz
Figure Helix2025.02High-level planning (7-9Hz)Low-level control (200Hz)200Hz
Gemini Robotics2025.03Cloud inferenceOn-Device control-
GR00T N1.6 Architecture

GR00T N1.6 (NVIDIA)

Figure Helix Architecture

Figure Helix (Figure AI)

Gemini Robotics Architecture

Gemini Robotics (Google DeepMind)

Why is this structure necessary?

As discussed in Physical vs Cognitive Intelligence, physical actions require fast feedback at the millisecond level. Language understanding and planning, however, are relatively slow. Processing both with a single model inevitably requires a hierarchical structure.

Convergence 2: Continuous Action Generation

The “Action as Language” paradigm introduced by RT-2 represented actions as tokens. However, major models in 2025 adopted new approaches for continuous action spaces.

Discrete vs Continuous Action Token

Discrete Action Token (RT-1, RT-2, ACT, OpenVLA, etc.) represents robot actions as discrete tokens like LLMs:

  • Pros: Leverages LLM’s language understanding directly, autoregressive structure transfers VLM pretraining benefits
  • Cons: Token explosion at high-frequency control (50Hz+), precision loss in dexterous manipulation tasks

Continuous Action Token (π0, GR00T N1, SmolVLA, etc.) directly generates continuous values via Flow Matching or Diffusion:

  • Pros: Precise continuous control, efficient at high frequencies, handles multimodal actions naturally
  • Cons: Requires multiple denoising steps at inference, relatively limited in leveraging LLM’s language capabilities

Continuous Action Example: Gradually generating action sequence from noise (Source: Diffusion Policy)

For detailed analysis of these trade-offs, see the FAST Tokenizer document. FAST overcomes discrete token limitations with DCT+BPE compression, achieving 5x faster training.

2025 Major Models’ Choices

ModelAction GenerationFeatures
π0, π0.5Flow MatchingEfficient alternative to Diffusion, 50Hz control
GR00T N1Diffusion TransformerGenerate actions from noise, dual-system
SmolVLAFlow Matching450M lightweight model, runs on MacBook
LBMDiffusion TransformerWhole-body single model control, 48 timesteps
π0 Architecture

π0: PaliGemma VLM + Flow Matching Action Expert (Physical Intelligence)

GR00T N1 Architecture

GR00T N1: Eagle VLM + Diffusion Transformer (NVIDIA)

SmolVLA Architecture

SmolVLA: SmolVLM + Flow Matching (HuggingFace)

Robot joint control is inherently continuous. Representing it with discrete tokens causes precision loss and token explosion at high frequencies (50Hz+). The convergence of 2025 major models on Flow Matching/Diffusion is a natural solution to this problem.

Major Model Timeline (2025)

DateModelCompanyKey Contribution
2025.02Figure HelixFigure AIFirst full-body humanoid VLA
2025.03GR00T N1NVIDIAFirst open-source humanoid VLA
2025.03Gemini RoboticsGoogle DeepMindGemini 2.0-based, Cross-embodiment
2025.04π0.5Physical IntelligenceOpen-world generalization
2025.05SmolVLAHuggingFace450M lightweight VLA, runs on MacBook
2025.08LBMBoston Dynamics + TRIWhole-body single model control
2025.11π*0.6Physical IntelligenceRL self-improvement (RECAP)

2024: The Beginning of VLA

2024 was the year when full-fledged VLA (Vision-Language-Action) models emerged.

Key Breakthroughs

ReleaseModelSignificance
2024.06OpenVLAFirst large-scale open-source VLA (7B), matching 55B RT-2-X
2024.10π0Flow Matching VLA, origin of General Robot Policy

OpenVLA - The Start of Open-Source VLA

OpenVLA is a 7B parameter open-source VLA jointly developed by Stanford, UC Berkeley, TRI, Google DeepMind, and MIT.

FeatureDetails
Parameters7B (7x smaller than RT-2-X’s 55B)
PerformanceEqual or better than RT-2-X
Fine-tuningOnly 1.4% with LoRA, possible on consumer GPU
VersatilityOnly model achieving 50%+ success rate on all test tasks

OpenVLA contributed to democratizing VLA research. It became the foundation for subsequent lightweight open-source VLA research including SmolVLA and MiniVLA.

π0 - The Origin of General Robot Policy

π0 is the first general-purpose robot foundation model released by Physical Intelligence.

FeatureDetails
ArchitecturePaliGemma VLM + Flow Matching Action Expert
Control Frequency50Hz (Action Chunking)
Data8 robot platforms, 10,000+ hours of teleoperation
PerformanceOverwhelming vs OpenVLA/Octo on complex dexterous tasks

π0’s greatest contribution was proving the viability of the VLM + Flow Matching combination. Many subsequent models adopted similar architectures.


RT Series: The Foundation of VLA (2022-2023)

The starting point for all VLA research is Google DeepMind’s RT (Robotics Transformer) series.

RT-1 (2022.12) - The Start of Robotics Transformer

FeatureDetails
Data13 robots, 17 months, 130K episodes
Performance97% success rate on 700 training tasks
ContributionTokenization of robot I/O, large-scale real-world data training

RT-2 (2023.07) - Action as Language

RT-2 from the RT Series introduced the “Action as Language” paradigm.

Key IdeaDescription
Action as LanguageRepresent robot actions as text tokens
VLM-basedLeveraging PaLM-E (12B), PaLI-X (55B)
Emergent CapabilitiesInterpreting new semantic commands not in training data

RT-2’s “Action as Language” was revolutionary, but it also revealed the limitations of discrete token approaches. This led to the 2025 convergence on continuous action generation methods.

RT-X (2023.10) - Open X-Embodiment

Collaborated with 33 research labs to build an open-source dataset of 22 robot types and 1M+ episodes. This data was subsequently used for training many models including OpenVLA and GR00T N1.


Pioneering Research

Research that laid the foundation for robot learning before VLA.

Diffusion Policy (2023.03)

Diffusion Policy was the first to successfully apply diffusion, successful in image generation, to robot action generation.

ContributionDescription
Multimodal ActionHandling multiple valid actions in the same situation
High StabilityStable training compared to existing imitation learning
InfluenceDirect influence on π0’s flow matching, GR00T’s DiT, etc.

ACT (2023.04)

ACT enabled efficient imitation learning through the Action Chunking concept.

ContributionDescription
Action ChunkingExecute continuous actions as a single unit
Data Efficiency80-90% success with ~10 minutes of demonstrations (some tasks)
ALOHA HardwareBuild bimanual dexterous manipulation system for $20K

Most demo booths at exhibitions and conferences I’ve visited are built with ACT. Fast learning and low computational requirements made it the standard in research/demo environments.


Future Outlook

Unsolved Problems

ProblemStatus
Tactile SensingVisual information alone has limits, research starting with Sharpa etc.
RL-based Self-ImprovementStarted with π*0.6’s RECAP, still in research stage
Real-world Generalizationπ0.5 started open-world generalization, more validation needed
Synthetic Data UtilizationGR00T N1 reported 40% improvement, potential for expansion

Directions to Watch

  1. Expansion of Loco-Manipulation: Figure Helix 02 opened the door, more models will attempt locomotion+manipulation integration
  2. Tactile/Multimodal Sensing: Essential for tasks impossible with vision alone
  3. On-Device Lightweight Models: Locally executable models like Gemini Robotics On-Device
  4. RL Integration: Models that self-improve beyond demonstration data

See Also

Key Model Documents

See Also