GR00T N1.5

Humanoid Foundation Model with 2x Language Instruction Compliance via Frozen VLM and FLARE Loss

Author’s Note

  • A version focused on improving language instruction following toward general-purpose RFM. Frozen VLM preserves language understanding while combining with robot control—compliance doubled (46.6% → 93.3%).
  • Demonstrates learning from human video is possible. Thanks to FLARE, learning from action-label-free human videos works. A path to reducing dependence on expensive teleop data.
  • Synthetic data’s power remains evident. DreamGen at just 9.1% enabled learning 22 new verbs. Key to expanding behavioral diversity beyond pick-and-place.

Key Significance

  • 2x Language Instruction Compliance: 46.6% → 93.3% (+46.7%p)
  • Frozen VLM Technique: Freezes VLM to preserve language understanding capability
  • FLARE Loss Introduction: Additional training objective through implicit world modeling
  • Human Video Learning Capability: FLARE enables learning from human videos during post-training
  • Novel Object Manipulation: 0-shot manipulation of new objects possible after post-training (15%)

Important: The GR00T-N1.5-3B released on HuggingFace is a pretrained model. Human video learning, Unitree G1 experiments, etc. are post-training experiment results demonstrating the capability that FLARE provides, and are not included in the released model weights.


Overview

ItemDetails
AnnouncedMay 20, 2025 (Computex 2025, Taiwan)
TypeVision-Language-Action (VLA)
Parameters3B
VLMEagle 2.5 (Frozen)
DiT16 layers
Key TechnologyFrozen VLM + FLARE Loss
GitHubNVIDIA/Isaac-GR00T
Hugging Facenvidia/GR00T-N1.5-3B

Key Improvements over N1

1. Frozen VLM (Vision-Language Model)

Core architecture change in N1.5.

AspectN1N1.5
VLM TrainingTrainableFrozen
VLM ModelEagle2-1BEagle 2.5
Grounding IoU35.540.4 (GR-1)

Key Features:

  • VLM remains frozen during both pretraining and finetuning
  • Preserves language understanding capability and improves generalization
  • Enhanced physical understanding and grounding with NVIDIA Eagle 2.5
  • Simplified adapter MLP + Layer Normalization on visual/text token embeddings

2. FLARE Loss (Future LAtent REpresentation Alignment)

New training objective added in N1.5.

Concept:

  • Adds FLARE Loss to existing Flow Matching Loss from N1
  • Instead of generatively modeling future frames, aligns with latent representations of future states
  • Policy network internally reasons about future latent states while maintaining action prediction capability

How it works:

  1. Add learnable “future tokens” to standard VLA model
  2. Train these tokens to align with future robot observation embeddings
  3. Calculate Future Latent Alignment Loss using cosine similarity
  4. FLARE Loss coefficient: 0.2 (both pretraining and post-training)

Key Benefits:

  • Direct learning from human egocentric videos possible
  • Meaningful learning from human videos alone without robot demonstrations
  • Significantly improved novel object manipulation capability

Related Paper: FLARE: Robot Learning with Implicit World Modeling (arXiv:2505.15659)


Architecture

Overall Architecture

ComponentDescription
Vision EncoderSigLip2-based Vision Transformer (224x224 RGB input)
Language EncoderT5-based Transformer
Proprioception EncoderMLP indexed by embodiment ID
Action DecoderFlow Matching Transformer (DiT-based)
Model Size3B parameters
Tensor TypeBF16

N1 vs N1.5 Architecture Comparison

ItemGR00T N1GR00T N1.5
VLM StateTrainableFrozen
VLM ModelEagle2-1BEagle 2.5
Adapter MLPComplexSimplified + LayerNorm
Training ObjectiveFlow MatchingFlow Matching + FLARE
World ModelingNoneImplicit world modeling integrated
Model Parameters2.2B3B

Benchmarks

Language Instruction Compliance (Real GR-1 Humanoid)

Task: Pick up a specific fruit specified by language command from two fruits and place it on a plate:

ModelLanguage Instruction Compliance
GR00T N146.6%
GR00T N1.593.3%

Improvement: +46.7%p (approximately 2x)

Simulation Benchmarks

BenchmarkGR00T N1GR00T N1.5Improvement
Language Table (sim)52.8%93.2%+40.4%p
Sim GR-1 Language36.4%54.4%+18.0%p
RoboCasa (30 demos)17.4%47.5%+30.1%p
DreamGen Tasks (12)13.1%38.3%+25.2%p

Real Robot Benchmarks (GR-1 Humanoid)

TaskGR00T N1GR00T N1.5
Language Instruction Compliance46.6%93.3%
Novel Object Manipulation (0-shot)0%15.0%

FLARE Standalone Performance

100 trajectories/task on real GR-1 manipulation tasks: 95.1% average success rate

Human Video Learning Effect

ConditionSuccess Rate
1-shot (robot demo only)37.5%
1-shot + human egocentric video60.0%
10-shot + human egocentric video80.0%

Training

Pretraining

Pretraining data composition for the GR00T-N1.5-3B model released on HuggingFace.

N1.5 Pretraining Data Distribution

GR00T N1.5 Pretraining Data Distribution (Source: NVIDIA Research)

Pretraining Data Composition

Data SourceProportionDescription
Real GR-127.3%Real robot data collected internally by NVIDIA
OpenXE27.3%Open X-Embodiment open-source data
Sim GR-1 (DexMG)27.3%Simulation synthetic data
DreamGen9.1%Neural trajectory synthetic data
AgiBot-Beta9.1%AgiBot collaboration data

Note: The pretraining data does not include human video data. Human video learning is a capability that FLARE provides and is utilized during post-training.

Training Infrastructure

ItemDetails
GPU1,000× H100
Training Steps250K steps
Batch Size16,384
OptimizerAdamW
Learning Rate ScheduleCosine (warmup ratio 0.05)
FLARE Loss Coefficient0.2

FLARE (Future LAtent REpresentation Alignment)

The core training objective added in N1.5. FLARE was published as a separate paper (arXiv:2505.15659).

Core Concepts

FLARE is a lightweight approach that aligns with latent representations of future states instead of generating future frames pixel-by-pixel.

Future Tokens Mechanism:

  1. Add learnable “future tokens” embeddings to standard VLA models
  2. Extract intermediate representations corresponding to M future tokens at internal layer L of Diffusion Transformer
  3. Project via MLP and align with frozen Vision-Language embeddings

Total Training Objective:

L = L_fm + λL_align (λ = 0.2)

Benefits of FLARE

  • Lightweight Implementation: Minimal architectural changes—just adding a few tokens to standard VLA models
  • Inference Efficiency: No future Vision-Language embedding computation needed at deployment
  • Human Video Learning Capability: Enables learning from action-label-free human videos during post-training
  • Up to 26% Performance Improvement: Over baselines on multitask simulation benchmarks

Post-training Experiment Results

The following are experiment results to validate the capability that FLARE provides. These results are not included in the released pretrained model.

Unitree G1 Post-training

Results of post-training N1 and N1.5 with 1,000 teleoperation episodes:

MetricGR00T N1GR00T N1.5
Known Fruit Manipulation Success44.0%98.8%
Generalization to 5 Novel Objects-84.2%

Human Video Learning (FLARE Paper Experiments)

FLARE’s key contribution is enabling learning from human egocentric videos without action labels.

Asymmetric Loss Application:

  • Robot demonstration data (with actions): Flow Matching Loss + Future Alignment Loss
  • Human videos (without actions): Future Alignment Loss only

Data Collection:

  • Head-mounted GoPro cameras for egocentric demonstration collection
  • Approximately 150 human egocentric demonstrations per object

Left: Human egocentric demonstration captured with GoPro / Right: GR-1 robot demonstration (Source: NVIDIA Research)

Experiment Results (pick-and-place on 5 objects with novel shapes):

ConditionSuccess RateImprovement
1-shot (robot demo only)37.5%-
1-shot + human egocentric video60.0%+22.5%p
10-shot + human egocentric video80.0%+42.5%p

These experiments were conducted in the FLARE paper, and it is not specified whether human video learning is included in the N1.5 pretrained model released on HuggingFace.


N1 vs N1.5 Comprehensive Comparison

AspectGR00T N1GR00T N1.5
AnnouncedMarch 2025 (GTC)May 2025 (Computex)
Model Size2.2B3B
VLMTrainableFrozen
VLM ModelEagle2-1BEagle 2.5
Training ObjectiveFlow MatchingFlow Matching + FLARE
Human Video LearningNot possiblePossible
Language Instruction Compliance46.6%93.3%
Language Table52.8%93.2%
RoboCasa (30 demos)17.4%47.5%
DreamGen Tasks13.1%38.3%

References

GR00T N1.5

FLARE

GR00T-Dreams

Base Model

News


See Also

GR00T Series

  • Eagle - N1.5’s VLM (Eagle 2.5)
  • DreamGen - GR00T-Dreams Synthetic Data Pipeline
  • Jim Fan - NVIDIA GEAR Lab, GR00T Research Lead