SmolVLA

Author’s Note

Proves Community Data is Sufficient. Achieves Pi0-level performance using only public data collected by the LeRobot community. An important case demonstrating that VLAs are possible without massive proprietary datasets.
Efficiency-First Design. Practical optimization techniques like Visual Token Reduction (64 tokens) and Layer Skipping (half the layers) are impressive.
VLA Democratization. A VLA that runs on MacBook significantly improves research accessibility. Anyone can experiment.
Fully Reproducible. Complete recipe from pretraining to fine-tuning publicly available. Can even train on Google Colab.

SmolVLA Overview

SmolVLA: Achieving large VLA-level performance with 450M parameters

Key Significance

10x Smaller Model, Equal Performance: 450M parameters achieving LIBERO 87.3% (OpenVLA 7B: 76.5%, Pi0 3.3B: 86.0%)
Community Data Only: Trained on 481 LeRobot datasets (10.6M frames)
Runs Anywhere: Works on MacBook, consumer GPU, CPU
Asynchronous Inference: 30% faster response (13.75s → 9.7s), 2x throughput
Fully Open-Source: Complete release of model, code, data with reproducibility

Overview

SmolVLA is a lightweight VLA model released by HuggingFace on June 3, 2025. With 450M parameters, it achieves equal or better performance than models 7-10x larger, trained only on public community data.

Item	Details
Released	June 3, 2025
Developer	HuggingFace (LeRobot team)
Parameters	450M (VLM ~350M + Action Expert ~100M)
Paper	arXiv:2506.01844
Blog	huggingface.co/blog/smolvla
Model	lerobot/smolvla_base
GitHub	huggingface/lerobot

Architecture

SmolVLA = SmolVLM2-500M + Flow Matching Action Expert (~100M)

Components

Component	Spec
Vision Encoder	SigLIP
Language Decoder	SmolLM2-1.7B based
Action Expert	~100M parameters, Flow Matching
Visual Tokens	64 per frame (16x compression vs 1024 standard)

Key Efficiency Techniques

Technique	Description
Visual Token Reduction	PixelShuffle compresses 512×512 images → 64 tokens
Layer Skipping	Uses only half (16/32) VLM layers → ~50% compute reduction
Interleaved Attention	Cross-Attention + Self-Attention interleaved

Action Expert

Hidden size: 0.75x of VLM dimension
Flow Matching for continuous action generation
Action chunk: 10-50 actions (default 50)

Training Data

Item	Details
Datasets	481 LeRobot community datasets
Episodes	22.9K
Frames	10.6M
Primary Robot	SO100 robot arm
FPS	Standardized at 30 FPS

Approximately 1 order of magnitude less data than competing VLAs

Data Preprocessing

Task Annotations: Standardized with Qwen2.5-VL-3B-Instruct
Camera Standardization: OBS_IMAGE_1 (top-down), OBS_IMAGE_2 (wrist), OBS_IMAGE_3+ (additional views)

Reproducible Recipe

SmolVLA provides a complete recipe to reproduce the entire training process from pretraining to fine-tuning. The key insight is that each model is built sequentially on top of the previous one:

1. SmolLM2 — LLM

2. SmolVLM2 — VLM

3. SmolVLA — VLA

Step 1 — SmolLM2: A lightweight language model. The foundational LLM responsible for text understanding and generation — everything in SmolVLA starts here.

Step 2 — SmolVLM2: A Vision-Language Model that combines SmolLM2 with a SigLIP vision encoder. This adds the ability to understand images and video.

Step 3 — SmolVLA: A VLA that attaches a Flow Matching Action Expert to SmolVLM2 to output robot actions. Beyond seeing and understanding, the model can now act.

The training recipe for every stage is fully open, allowing anyone to reproduce the entire pipeline from scratch.

Official Resources:

VLAb - Official SmolVLA pretraining reproduction kit

smollm - SmolLM/SmolVLM backbone training recipe

LeRobot - For fine-tuning and inference

Backbone Model

SmolVLA uses SmolVLM2-500M-Video-Instruct as the VLM backbone. The SmolVLM training recipe is available in the smollm repository.

Pretraining (VLAb)

VLAb is a SmolVLA pretraining library derived from LeRobot.

Item	Value
Policy	SmolVLA2
Base Model	SmolVLM2-500M-Video-Instruct
Steps	200,000
Multi-GPU	Accelerate + SLURM support

# VLAb pretraining example
accelerate launch --config_file accelerate_configs/multi_gpu.yaml \
  src/lerobot/scripts/train.py \
  --policy.type=smolvla2 \
  --policy.repo_id=HuggingFaceTB/SmolVLM2-500M-Video-Instruct \
  --dataset.repo_id="dataset_paths" \
  --steps=200000

Community Dataset

Version	Datasets	Contributors	Episodes	Frames	Size
v1	128	55	11.1K	5.1M	119.3 GB
v2	340	117	6.3K	5M	59 GB

SO-100 robot arm based tabletop manipulation tasks
v1: Quality filtered + task description curated version (used for pretraining)

# Download dataset
huggingface-cli download HuggingFaceVLA/community_dataset_v1 \
  --repo-type=dataset \
  --local-dir /path/to/community_dataset_v1

Fine-tuning (LeRobot)

After pretraining, LeRobot is recommended for fine-tuning and inference.

Item	Value
Steps	20,000 (recommended)
Batch Size	64
Duration	~4 hours (single A100)

Note: VLAb checkpoints may not be directly compatible with LeRobot due to normalization format differences. Use LeRobot’s migration script for conversion.

Performance

Simulation Benchmarks

LIBERO:

Model	Parameters	Success Rate
SmolVLA	0.45B	87.3%
Pi0	3.3B	86.0%
OpenVLA	7B	76.5%

Meta-World:

Model	Parameters	Success Rate
SmolVLA	0.45B	57.3%
Pi0	3.5B	47.9%
TinyVLA	-	31.6%

Real Robot (SO100)

Task	Success Rate
Pick-Place	75%
Stacking	90%
Sorting	70%
Average	78.3%

Comparison: Pi0 (3.5B) 61.7%, ACT 48.3%

Cross-Embodiment (SO101)

Condition	Success Rate
In-Distribution	90%
Out-of-Distribution	50%

Asynchronous Inference

SmolVLA’s differentiating feature: Separating action prediction and execution

How It Works

Early Trigger: Sends new observation when action queue < 70%
Decoupled Threads: Inference and control loop run separately
Chunk Fusion: Merges overlapping actions from successive chunks

Performance

Mode	Completion Time	Completions in 60s
Synchronous	13.75s	9
Asynchronous	9.7s	19

→ 30% faster response, 2x throughput

Quick Start

Installation

git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[smolvla]"

Fine-tuning

python lerobot/scripts/train.py \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=lerobot/svla_so100_stacking \
  --batch_size=64 \
  --steps=20000

Load Model

from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained("lerobot/smolvla_base")

Author’s Note

Key Significance

Overview

Architecture

Components

Key Efficiency Techniques

Action Expert

Training Data

Data Preprocessing

Reproducible Recipe

Backbone Model

Pretraining (VLAb)

Community Dataset

Fine-tuning (LeRobot)

Performance

Simulation Benchmarks

Real Robot (SO100)

Cross-Embodiment (SO101)

Asynchronous Inference

How It Works

Performance

Quick Start

Installation

Fine-tuning

Load Model

References

See Also