SmolVLA

HuggingFace's Lightweight Open-Source VLA Model - Achieving Large Model Performance with 450M Parameters

Author’s Note

  • Proves Community Data is Sufficient. Achieves Pi0-level performance using only public data collected by the LeRobot community. An important case demonstrating that VLAs are possible without massive proprietary datasets.
  • Efficiency-First Design. Practical optimization techniques like Visual Token Reduction (64 tokens) and Layer Skipping (half the layers) are impressive.
  • VLA Democratization. A VLA that runs on MacBook significantly improves research accessibility. Anyone can experiment.
  • Fully Reproducible. Complete recipe from pretraining to fine-tuning publicly available. Can even train on Google Colab.

SmolVLA Overview

SmolVLA: Achieving large VLA-level performance with 450M parameters

Key Significance

  • 10x Smaller Model, Equal Performance: 450M parameters achieving LIBERO 87.3% (OpenVLA 7B: 76.5%, Pi0 3.3B: 86.0%)
  • Community Data Only: Trained on 481 LeRobot datasets (10.6M frames)
  • Runs Anywhere: Works on MacBook, consumer GPU, CPU
  • Asynchronous Inference: 30% faster response (13.75s → 9.7s), 2x throughput
  • Fully Open-Source: Complete release of model, code, data with reproducibility

Overview

SmolVLA is a lightweight VLA model released by HuggingFace on June 3, 2025. With 450M parameters, it achieves equal or better performance than models 7-10x larger, trained only on public community data.

ItemDetails
ReleasedJune 3, 2025
DeveloperHuggingFace (LeRobot team)
Parameters450M (VLM ~350M + Action Expert ~100M)
PaperarXiv:2506.01844
Bloghuggingface.co/blog/smolvla
Modellerobot/smolvla_base
GitHubhuggingface/lerobot

Architecture

SmolVLA = SmolVLM2-500M + Flow Matching Action Expert (~100M)

Components

ComponentSpec
Vision EncoderSigLIP
Language DecoderSmolLM2-1.7B based
Action Expert~100M parameters, Flow Matching
Visual Tokens64 per frame (16x compression vs 1024 standard)

Key Efficiency Techniques

TechniqueDescription
Visual Token ReductionPixelShuffle compresses 512×512 images → 64 tokens
Layer SkippingUses only half (16/32) VLM layers → ~50% compute reduction
Interleaved AttentionCross-Attention + Self-Attention interleaved

Action Expert

  • Hidden size: 0.75x of VLM dimension
  • Flow Matching for continuous action generation
  • Action chunk: 10-50 actions (default 50)

Training Data

ItemDetails
Datasets481 LeRobot community datasets
Episodes22.9K
Frames10.6M
Primary RobotSO100 robot arm
FPSStandardized at 30 FPS

Approximately 1 order of magnitude less data than competing VLAs

Data Preprocessing

  • Task Annotations: Standardized with Qwen2.5-VL-3B-Instruct
  • Camera Standardization: OBS_IMAGE_1 (top-down), OBS_IMAGE_2 (wrist), OBS_IMAGE_3+ (additional views)

Reproducible Recipe

SmolVLA provides a complete recipe to reproduce the entire training process from pretraining to fine-tuning. The key insight is that each model is built sequentially on top of the previous one:

SmolLM2

1. SmolLM2 — LLM

SmolVLM2

2. SmolVLM2 — VLM

SmolVLA

3. SmolVLA — VLA

Step 1 — SmolLM2: A lightweight language model. The foundational LLM responsible for text understanding and generation — everything in SmolVLA starts here.

Step 2 — SmolVLM2: A Vision-Language Model that combines SmolLM2 with a SigLIP vision encoder. This adds the ability to understand images and video.

Step 3 — SmolVLA: A VLA that attaches a Flow Matching Action Expert to SmolVLM2 to output robot actions. Beyond seeing and understanding, the model can now act.

The training recipe for every stage is fully open, allowing anyone to reproduce the entire pipeline from scratch.

Official Resources:

  • VLAb - Official SmolVLA pretraining reproduction kit
  • smollm - SmolLM/SmolVLM backbone training recipe
  • LeRobot - For fine-tuning and inference

Backbone Model

SmolVLA uses SmolVLM2-500M-Video-Instruct as the VLM backbone. The SmolVLM training recipe is available in the smollm repository.

Pretraining (VLAb)

VLAb is a SmolVLA pretraining library derived from LeRobot.

ItemValue
PolicySmolVLA2
Base ModelSmolVLM2-500M-Video-Instruct
Steps200,000
Multi-GPUAccelerate + SLURM support
# VLAb pretraining example
accelerate launch --config_file accelerate_configs/multi_gpu.yaml \
  src/lerobot/scripts/train.py \
  --policy.type=smolvla2 \
  --policy.repo_id=HuggingFaceTB/SmolVLM2-500M-Video-Instruct \
  --dataset.repo_id="dataset_paths" \
  --steps=200000

Community Dataset

VersionDatasetsContributorsEpisodesFramesSize
v11285511.1K5.1M119.3 GB
v23401176.3K5M59 GB
  • SO-100 robot arm based tabletop manipulation tasks
  • v1: Quality filtered + task description curated version (used for pretraining)
# Download dataset
huggingface-cli download HuggingFaceVLA/community_dataset_v1 \
  --repo-type=dataset \
  --local-dir /path/to/community_dataset_v1

Fine-tuning (LeRobot)

After pretraining, LeRobot is recommended for fine-tuning and inference.

ItemValue
Steps20,000 (recommended)
Batch Size64
Duration~4 hours (single A100)

Note: VLAb checkpoints may not be directly compatible with LeRobot due to normalization format differences. Use LeRobot’s migration script for conversion.


Performance

Simulation Benchmarks

LIBERO:

ModelParametersSuccess Rate
SmolVLA0.45B87.3%
Pi03.3B86.0%
OpenVLA7B76.5%

Meta-World:

ModelParametersSuccess Rate
SmolVLA0.45B57.3%
Pi03.5B47.9%
TinyVLA-31.6%

Real Robot (SO100)

TaskSuccess Rate
Pick-Place75%
Stacking90%
Sorting70%
Average78.3%

Comparison: Pi0 (3.5B) 61.7%, ACT 48.3%

Cross-Embodiment (SO101)

ConditionSuccess Rate
In-Distribution90%
Out-of-Distribution50%

Asynchronous Inference

SmolVLA’s differentiating feature: Separating action prediction and execution

How It Works

  1. Early Trigger: Sends new observation when action queue < 70%
  2. Decoupled Threads: Inference and control loop run separately
  3. Chunk Fusion: Merges overlapping actions from successive chunks

Performance

ModeCompletion TimeCompletions in 60s
Synchronous13.75s9
Asynchronous9.7s19

→ 30% faster response, 2x throughput


Quick Start

Installation

git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[smolvla]"

Fine-tuning

python lerobot/scripts/train.py \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=lerobot/svla_so100_stacking \
  --batch_size=64 \
  --steps=20000

Load Model

from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained("lerobot/smolvla_base")

References


See Also