GR00T N1

World's First Open-Source Humanoid Robot Foundation Model

Author’s Note

  • Systematic validation of synthetic data effects. Utilized both simulation trajectories and neural trajectories simultaneously, achieving 40% performance improvement over real data alone. Demonstrates synthetic data can be a core learning resource.
  • A breakthrough for the action data scaling problem. LAPA enables learning from human videos without action labels. Pipeline generates 780K trajectories in just 11 hours, showing new possibilities for data scaling.
  • Adoption of System 1/2 hierarchical structure. VLM (System 2) handles high-level understanding at 10Hz, DiT (System 1) handles low-level motor control at 120Hz. A practical design addressing both real-time control constraints and language-action mapping.

Key Significance

  • World’s First Open Humanoid Foundation Model: First open VLA for humanoid robots
  • Dual-System Architecture: System 2 (VLM) + System 1 (DiT) structure inspired by human cognition
  • Demonstrated Power of Synthetic Data: Generated 780K trajectories in 11 hours, 40% performance improvement over real data only
  • Cross-Embodiment Support: Single model supports various robot platforms
  • Fully Open Source: Model, code, and evaluation scenarios all released under Apache 2.0 license

GR00T N1 Architecture

GR00T N1 Architecture: System 2 (VLM) + System 1 (Diffusion Transformer) Dual-System Structure


Overview

ItemDetails
AnnouncedMarch 18, 2025 (GTC 2025)
TypeVision-Language-Action (VLA)
Total Parameters2.2B
VLM Parameters1.34B
PaperarXiv:2503.14734
GitHubNVIDIA/Isaac-GR00T
Hugging Facenvidia/GR00T-N1-2B
LicenseApache 2.0

Architecture

GR00T N1 adopts a Dual-System architecture inspired by human cognitive processing (Kahneman, 2011).

System 2: Vision-Language Model (Slow Thinking)

Responsible for environment interpretation and task goal understanding.

ComponentDetails
Base ModelEagle2-1B VLM
LLM BackboneQwen2.5-0.5B-Instruct
Image EncoderSigLIP-2 Vision Transformer
Image Resolution224x224
Image Tokens64 (after pixel shuffle)
Execution Frequency10Hz

Processing Pipeline:

  1. Process RGB camera frames through SigLIP-2
  2. Process text commands through T5 encoder
  3. Combine image and text to generate environment and task goal tokens

System 1: Diffusion Transformer (Fast Thinking)

Responsible for real-time motor action generation.

ComponentDetails
ArchitectureDiffusion Transformer (DiT)
Layers16
Training MethodAction Flow-Matching
ConditioningAdaptive LayerNorm (AdaLN)
Execution Frequency120Hz

Processing Pipeline:

  1. Receive VLM output tokens and robot proprioceptive state
  2. Integrate information through Cross-Attention
  3. Support various robot platforms with embodiment-specific encoder/decoder
  4. Generate smooth and precise motor commands through denoising

Inference Performance

ItemValue
Inference Time63.9ms (16 action chunks)
Inference GPUNVIDIA L40 (bf16)
Memory Requirement~10-12 GB

Training

GR00T N1 addresses the “data island” problem by organizing heterogeneous data sources into a unified pyramid structure.

Data Pyramid

GR00T N1 Data Pyramid

GR00T N1 Data Pyramid: Hierarchical structure of real data, synthetic data, and web-scale data

Data Composition by Layer

LayerData TypeScaleRole
TopReal robot teleoperation~88 hours (GR00T humanoid)Embodiment-specific grounding
MiddleSynthetic data780K simulation trajectories + ~827 hours neural trajectoriesBridge data quantity and embodiment specificity
BaseWeb-scale videosEgo4D, EPIC-KITCHENS, Assembly-101, HOI4D, etc.Broad visual/behavioral priors

Additional Data Sources

  • Open X-Embodiment: Open datasets from various robot platforms
  • AgiBot-Alpha: 140,000 trajectories

Latent Action Pre-training (LAPA)

LAPA is a core methodology for leveraging action-less videos (human videos, web videos) in training.

Latent Action Definition

Latent Actions are compressed motion information extracted from consecutive video frames, representing motion without explicit robot action labels.

VQ-VAE Architecture

Current frame (x_t) ─┐
                     ├─→ [Encoder] ─→ Latent Action ─→ [Decoder] ─→ Future frame reconstruction
Future frame (x_t+H) ┘
ComponentFunction
EncoderExtract latent action embeddings from current/future frame pairs
CodebookQuantized latent action space (shared motion vocabulary)
DecoderReconstruct future frame from latent action + current frame

Training and Inference Process

  1. VQ-VAE Training: Simultaneous training on all heterogeneous data (robot + human videos)
  2. Codebook Generation: Create unified latent action space across embodiments
  3. At Inference: Use continuous pre-quantized embeddings as action labels
  4. Policy Training: Treat LAPA as a separate “embodiment” for training

Cross-Embodiment Unification

LAPA’s key innovation is the ability to process human videos and robot data in the same action space. The codebook shows consistent semantics across 8 different embodiments (including humans) (e.g., “move right arm left”).

LAPA vs IDM Performance Comparison

Data AmountLAPAIDMNotes
30 demosSuperiorInferiorLAPA advantage in low-data regime
100 demosEqualEqual-
300 demosInferiorSuperiorIDM advantage as data increases

IDM (Inverse Dynamics Model) improves alignment with real actions as data increases


Synthetic Data Generation

GR00T N1 uses two types of synthetic data: Simulation Trajectories and Neural Trajectories.

Simulation Trajectories (GR00T-Mimic / DexMimicGen)

Synthetic data generation using NVIDIA Isaac GR00T Blueprint workflow:

ItemValue
Generated Trajectories780,000
Generation Time11 hours
Equivalent Human Demonstration Time6,500 hours (~9 months continuous work)
Task Types54 unique receptacle category combinations

Generation Workflow:

  1. Human Demonstration Collection: Teleoperation via Leap Motion device
  2. Subtask Segmentation: Segment demonstrations into object-centric subtasks
  3. Automatic Transformation and Replay: Automatic transformation in simulation environment
  4. Environment Adaptation: Environment adaptation through object position alignment
  5. Quality Filtering: Retain only successful executions

Key Features:

  • Built on RoboCasa simulation framework
  • Randomized object/receptacle placement with distractors
  • Only physically valid trajectories generated (simulator guaranteed)
  • Ground-truth action data available

Key Tools:

  • GR00T-Mimic: Generate large synthetic trajectories from few human demonstrations
  • NVIDIA Cosmos Transfer: Photorealistic lighting, color, texture augmentation
  • Isaac Lab: Robot policy training through imitation learning

Neural Trajectories

Synthetic data using video generation models:

ItemValue
Total Generation Time~827 hours (10x augmentation of real data)
Generated Trajectories~300,000
GPU Hours Required105,000 L40 GPU-hours (~1.5 days on 3,600 GPUs)

Generation Process:

  1. Video Model Fine-tuning: Fine-tune image-to-video model on real robot data
  2. Scenario Generation: Generate diverse counterfactual scenarios with novel language prompts
  3. Object Detection: Detect objects in initial frames using commercial multimodal LLM
  4. Prompt Combination: Generate “pick {object} from {location A} to {location B}” combinations
  5. Post-processing Filtering: Filtering through LLM judgment
  6. Re-captioning: Re-generate captions for filtered videos

Neural vs Synthetic Trajectory Comparison

AspectNeural TrajectoriesSimulation Trajectories
SourceVideo generation models fine-tuned on real dataPhysics simulator with automatic transformation
DiversityExtremely diverse (rare events like liquid pouring possible)Limited by simulator physics constraints
Scalability2 minutes per second of video780K trajectories in 11 hours
Physical AccuracyMay violate physics; requires post-filteringPhysical validity guaranteed in simulation
Action LabelsLatent actions or IDM-inferred pseudo-actionsGround-truth action data available
Counterfactual GenerationEasily generate new scenarios with promptsRequires explicit environment manipulation

Training Data Composition

Scale by Data Source

Data SourceScaleType
GR00T Humanoid Real Data~88 hoursReal robot
Simulation Trajectories780,000 (equivalent to 6,500 hours)Synthetic
Neural Trajectories~300,000 (~827 hours)Synthetic
AgiBot-Alpha140,000 trajectoriesReal robot
Open X-EmbodimentVarious robot platformsReal robot
Human VideosEgo4D, EPIC-KITCHENS, Assembly-101, HOI4D, etc.Web-scale

Performance Contribution Analysis

Neural Trajectory Addition Effect (Post-training):

Benchmark30 demos100 demos300 demos
RoboCasa+4.2%+8.8%+6.8%

Real Environment (GR-1 Humanoid):

  • 8-task average: +5.8% improvement

Synthetic Data vs Real Data Only:

  • Overall performance improvement: +40% (synthetic+real data vs real data only)

Key Insights

  1. Synthetic data consistently shows positive transfer effects
  2. Neural trajectories are particularly effective for rare scenarios and diverse manipulation tasks
  3. Simulation trajectories are effective for generating large amounts of physically valid data
  4. The two types of synthetic data work complementarily

Cross-Embodiment Learning

Multi-Embodiment Architecture

Separate MLPs are used per embodiment to project states/actions to a shared embedding dimension.

Supported Embodiment Types:

  • Single-arm manipulator (Franka Emika Panda)
  • Bimanual system with parallel-jaw grippers
  • Bimanual system with dexterous hands
  • Full-body control humanoid robot (GR-1)
  • Latent action embodiment (LAPA) - for video data

Unified Training Strategy

Co-training Approach:

  1. Batch Sampling: Sample training batches from heterogeneous data mixture
  2. Shared Backbone: End-to-end optimization with shared vision-language backbone
  3. Embodiment-specific Decoders: Embodiment-specific decoders for action output dimensions
  4. Dual System Training: Simultaneous training of System 1 (DiT) and System 2 (VLM)

Cross-Embodiment Generalization

The latent action codebook creates a shared motion vocabulary between humans and robots. Retrieved latent embeddings show consistent semantics across 8 different embodiments (including human and robotic forms).


Training Infrastructure

ItemDetails
GPUUp to 1,024x H100
GR00T-N1-2B Pre-training50,000 H100 GPU-hours
Training Steps250K steps
Batch Size16,384
FrameworkIsaac Lab + Omniverse
Distributed TrainingCustom library built on Ray (fault-tolerant multi-node training)
OrchestrationNVIDIA OSMO platform

Benchmarks

Simulation Benchmarks (3 Suites)

MethodRoboCasaDexMGGR-1 TabletopAverage
BC Transformer26.3%53.9%16.1%26.4%
Diffusion Policy25.6%56.1%32.7%33.4%
GR00T N1 2B32.1%66.5%50.0%45.0%

LIBERO Benchmark

TaskN1 Success Rate
LIBERO-Object96.7%
LIBERO-Spatial92.5%
LIBERO-Goal85.0%
LIBERO-Long78.3%

Real-World Tests (GR-1 Humanoid, 10% Training Data)

Task TypeDiffusion PolicyGR00T N1 2BImprovement
Pick-and-Place3.0%35.0%+1067%
Articulated14.3%62.0%+333%
Industrial6.7%31.0%+363%
Coordination27.5%50.0%+82%
Average10.2%42.6%+318%

Language Instruction Compliance

ItemValue
Language Instruction Compliance46.6%

Significantly improved to 93.3% in N1.5


Capabilities

Performable Tasks

  • Object grasping
  • Moving objects with single/bimanual arms
  • Object transfer between arms (bimanual handover)
  • Multi-step tasks requiring long context
  • Combination of general skills

Key Features

FeatureDescription
Natural Language UnderstandingUnderstands and executes language instructions
Motion ImitationLearns by observing human behavior
GeneralizationEasily generalizes to common tasks
Cross-EmbodimentSupports various robot platforms

Tested Robots (Per Paper)

Robot platforms actually tested in the GR00T N1 paper:

Real Robots

PlatformTypeTask Type
Fourier GR-1HumanoidLanguage-conditioned bimanual manipulation (primary real-world testing)

Simulation Benchmarks

PlatformBenchmarkTask Type
Franka Emika PandaRoboCasaTabletop manipulation (24 atomic tasks)
Bimanual Panda ArmsDexMimicGenBimanual manipulation (parallel-jaw grippers)
Bimanual Panda Arms + Dexterous HandsDexMimicGenBimanual + dexterous manipulation
GR-1 HumanoidGR-1 TabletopHumanoid tabletop manipulation

Note: Unitree G1, Agibot Genie-1, etc. were added in N1.5 and N1.6


Early Access Partners

Per NVIDIA official announcement:

CompanyRobot/PlatformNote
1X TechnologiesNEOAutonomous home tidying demo at GTC 2025 keynote
Agility RoboticsDigit
Boston DynamicsAtlas
Mentee RoboticsMenteeBot
NEURA Robotics4NE-1

Authors

Project Leads:

Core Authors (partial):

  • Dieter Fox - NVIDIA, University of Washington
  • Jan Kautz - NVIDIA VP of Learning and Perception Research
  • Ajay Mandlekar - NVIDIA Research
  • Soroush Nasiriany - NVIDIA Research
  • and 41 others

Installation

System Requirements

For Fine-tuning:

  • OS: Ubuntu 20.04 / 22.04
  • GPU: H100, L40, RTX 4090, A6000
  • Python: 3.10
  • CUDA: 12.4

For Inference:

  • OS: Ubuntu 20.04 / 22.04
  • GPU: RTX 3090, RTX 4090, A6000

Installation Method

# Create Conda environment
conda create -n gr00t python=3.10
conda activate gr00t

# Install dependencies
pip install --upgrade setuptools
pip install -e .[base]
pip install --no-build-isolation flash-attn==2.7.1.post4

References


See Also

GR00T Series

  • Eagle - N1’s VLM
  • DreamGen - Synthetic Data Generation Pipeline
  • Jim Fan - NVIDIA GEAR Lab, GR00T Research Lead