The Action Data Scaling Problem

Why VLA cannot scale as easily as LLMs

Example of action data in LeRobot Dataset format. Each joint’s state values are recorded, but such data does not exist on the internet.


The Core Problem

It seems possible to extend LLMs to implement VLAs, which could impact a massive labor market. However, there are fundamental barriers that make it difficult for VLA to directly follow LLM’s success formula:

  • Action data doesn’t exist on the internet. LLMs could scale by leveraging the vast text data on the internet, but robot action data is not recorded anywhere online, making immediate scaling impossible.
  • Evaluation requires operating physical robots. The risks of hardware failure or environmental destruction (such as breaking dishes) are too significant, making it difficult to build automated benchmarks like those used for LLMs.
  • There are other fundamental challenges. Lack of essential understanding of Physical Intelligence, difficulties in implementing tactile sensors, and challenges in mass-producing dexterous hardware are among the many problems that remain.

This document focuses on the action data scarcity problem and the various approaches being taken to solve it.


Differences from LLMs

AspectLLMVLA
Data SourceInternet (virtually unlimited)Real robot actions (limited)
Collection CostLowHigh
EvaluationCan be automatedRequires physical robot operation

Table: Comparison of data collection and evaluation between LLM and VLA — the fundamental reason why VLA scaling is difficult

LLMs were able to leverage the vast text data accumulated on the internet for training, and the quality of generated text can be automatically evaluated. In contrast, VLAs require moving actual robots to collect data, and the success of actions must be physically verified. This is the fundamental bottleneck of VLA scaling.


Various Action Data Collection Methods

To solve these problems, various companies and research groups are trying different approaches. Let’s examine the main methods below.

Teleoperation

Teleoperation is a method where humans remotely control robots while collecting action data. It’s the most direct data collection method but has limitations as it requires human labor.

1957: The Beginning of Teleoperation

1957 teleoperation system. The history of remote-controlled robots is longer than one might think.

ALOHA

ALOHA open-source bimanual teleoperation system

ALOHA is a low-cost teleoperation system developed at Stanford. It was used in the ACT (Action Chunking with Transformers) paper, and both the hardware design and software are fully open-source, making it easy for researchers to replicate. The release of this system has greatly contributed to the democratization of robot learning research.

Tesla

Tesla teleoperation data collection team

Tesla is collecting action data by paying $48 per hour to teleoperators for their humanoid robot Optimus. Requirements include height between 5’7” ~ 5’11” (about 170-180cm), ability to walk 7+ hours per day, and carry loads up to 30 pounds (about 13.6kg). This is because the teleoperator’s movements are directly reflected in the actual robot.

VR teleoperation demonstration

When you actually try teleoperation with VR equipment, sustaining it for extended periods is extremely difficult. The weight of the VR headset, restricted field of view, and repetitive motions while gripping controllers cause severe fatigue after just a few hours. This is one of the fundamental bottlenecks of teleoperation-based data collection.


UMI-Style Data Collection

UMI data collection system

UMI (Universal Manipulation Interface) is a system that enables manipulation data collection without robots using a handheld gripper. It records human manipulation actions without teleoperation equipment and can transfer the learned skills to various robots.

The advantage of this approach is that data can be collected without robot hardware, greatly improving the scalability of data collection.


Simulation

NVIDIA Isaac & Cosmos

NVIDIA Isaac GR00T Synthetic Manipulation

NVIDIA Isaac GR00T Synthetic Manipulation is a Blueprint that generates synthetic data in simulation environments for robot manipulation learning. It enables mass production of training data across various scenarios without collecting real robot data.

Simulation-based approaches can significantly reduce data collection costs, but overcoming the sim-to-real gap between simulation and reality is the key challenge.


HuggingFace Community

HuggingFace is driving community-based data collection through its open-source ecosystem. Their success formula is as follows:

  • Open Source HW, SW: Making hardware and software designs public so anyone can participate
  • Data & Model Hub: Providing a central hub where datasets and models can be shared
  • Tutorial & Hackathon: Encouraging community participation through educational materials and hackathons

smolVLA, a VLA trained using community data, demonstrates the results of this approach.


World Model + IDM

1X World Model Self-Learning

1X is researching methods for robots to learn from unlabeled video data using World Models and IDM (Inverse Dynamics Model). This approach shows the potential to leverage large-scale video data without action labels.

While robot action data doesn’t exist on the internet, videos containing human movements are virtually unlimited. If actions can be extracted from these videos, it might be possible to break through the scaling problem.

For more details, see VLM Backbone Limitations and World Models.


Approaches Summary

ApproachOrganizationDescription
TeleoperationTesla, Google, Physical Intelligence, GalaxeaDirect data collection
Non-TeleopUMI, Generalist, Sunday RoboticsLearning from Non-Teleop data without robots
SimulationNVIDIAProduce, augment, and evaluate data with physics simulation (Omniverse) and World Model (Cosmos)
CommunityHuggingFaceCommunity-based data collection with open-source spirit
World Model1X, NVIDIAEvaluation automation, VLA backbone replacement, synthetic data generation with world models
Distributed EvaluationAcademiaOXE, RoboArena, etc.
OtherVariousAction extraction from human videos, egocentric data collection equipment, etc.

Table: Major approaches and organizations addressing the VLA scaling problem

Each approach has its own pros and cons, and it’s not yet clear which method is best. We need to keep watching the developments in this field.


Intro Guide Complete

You’ve completed the Physical AI Introduction Guide.

To explore further, return to the Physical AI Introduction Guide or read the insight essays below.

See Also