GR00T N1.6

Humanoid Foundation Model with 2x Larger DiT and Cosmos VLM

Author’s Note

  • Textbook scale-up. DiT doubled (16→32 layers), VLM upgraded from Eagle to Cosmos. Shows that bigger models produce smoother, more accurate motions.
  • Relative Action Space introduced. Predicts relative actions from current state instead of absolute coordinates. More robust to position changes with less jittery movements.
  • Full loco-manipulation support. Added whole-body motion data with Unitree G1 walking while manipulating. Directly addresses the core humanoid use case.

Key Significance

  • 2x DiT Expansion: Diffusion Transformer scale expansion from 16 to 32 layers
  • Cosmos VLM Introduction: VLM changed from Eagle to Cosmos-Reason-2B, enhanced reasoning capability
  • Relative Action Space: Improved generalization and adaptability with relative action space
  • Sim-to-Real Performance Improvement: Improved zero-shot simulation-to-real-world transfer
  • Loco-manipulation Support: Supports whole-body motion combining locomotion and manipulation

Overview

ItemDetails
AnnouncedSeptember 29, 2025 (CoRL 2025, Seoul)
TypeVision-Language-Action (VLA)
Parameters3B
VLMCosmos-Reason-2B
DiT32 layers (2x compared to N1.5)
Action SpaceRelative Action Space
GitHubNVIDIA/Isaac-GR00T
Hugging Facenvidia/GR00T-N1.6-3B

Key Improvements over N1.5

1. DiT Layer Expansion (16 → 32)

AspectN1.5N1.6
DiT Layers1632 (2x)
Effect-Smoother, less jittery movements, easier adaptation to changing positions

The larger 32-layer Diffusion Transformer, combined with state-relative action prediction, generates more flexible and adaptive motions.

2. Cosmos VLM (2B) Introduction

N1.6 uses NVIDIA Cosmos-Reason-2B VLM as its base VLM instead of Eagle.

AspectN1.5N1.6
VLMEagle 2.5 (1B)Cosmos-Reason-2B
Parameters~1B2B (2x)
VLM TrainingFully frozenTop 4 layers unfrozen
Adapter4-layer transformerRemoved

Cosmos-Reason Key Features:

  • Flexible resolution support: Can encode images at native aspect ratio without padding
  • Deep thinking capability: Serves as the robot’s “deep thinking brain”
  • Ambiguous instruction interpretation: Converts ambiguous instructions into step-by-step plans using prior knowledge, common sense, and physics

3. Relative Action Space

N1.6 predicts state-relative action chunks instead of absolute joint angles or EEF positions.

AspectN1/N1.5N1.6
Action SpaceAbsoluteRelative
Motion CharacteristicsFixed position basedRelative to current state

Advantages:

  • Smoother and more accurate motion generation
  • Easier adaptation to changing positions
  • Less jittery movements

Caveats:

  • Error accumulation may occur on small datasets, affecting correction capability

Architecture

GR00T N1.6 Model Architecture

GR00T N1.6 Model Architecture (Source: NVIDIA Research)

Key Architecture Changes (N1.5 → N1.6)

ComponentN1.5N1.6
Base VLMEagle 2.5 (frozen)Cosmos-Reason-2B (top 4 layers unfrozen)
DiT Size16 layers32 layers
VLM Post-processing Adapter4-layer transformer adapterRemoved

Benchmarks

Evaluation Environments

N1.6 is evaluated across various simulation and real robot environments:

EvaluationDescription
LIBEROEvaluation after 20-40k step post-training on LIBERO dataset
SimplerEnvEvaluation after finetuning on Google Robot fractal dataset
BEHAVIOR1kPost-training checkpoints provided
IsaacLabEvalTasksIndustrial manipulation tasks (Nut Pouring, Exhaust Pipe Sorting)

Real Robot Demonstrations

Tasks demonstrated on NVIDIA Research page:

  • T-shirt folding
  • Object insertion
  • Bimanual handoff
  • Loco-manipulation with Unitree G1

Performance Characteristics

According to NVIDIA Research page:

  • N1.6 converges faster than N1.5, generating smoother actions
  • Requires more careful tuning to prevent overfitting
  • 5-6% inter-experiment variance observed

Note: Specific benchmark numbers for N1.6 have not yet been published on the official research page. See N1 and N1.5 documents for their performance comparisons.


Training

Pretraining

ItemN1.6
Pretraining Steps300K
Global Batch Size16,384
Post-training Steps10K-30K (batch size ≤1K)

Pretraining Data Distribution

GR00T N1.6 Pretraining Data Distribution

GR00T N1.6 Pretraining Data Distribution (Source: NVIDIA Research)

N1.6 was trained with thousands of hours of new teleoperation data compared to N1.5.

Main Data Sources

Data SourcePlatform TypeDescription
Bimanual YAM ArmsBimanual manipulatorPrecise bimanual manipulation task data
AGIBot Genie1Semi-humanoidVarious manipulation task data
Simulated Galaxea R1 ProHumanoidSynthetic data based on BEHAVIOR suite
Unitree G1HumanoidWhole-body loco-manipulation data

Finetuned Checkpoints

N1.6 provides finetuned checkpoints for various tasks/environments.

CheckpointRobotTask
GR00T-N1.6-bridgeWidowXManipulation
GR00T-N1.6-fractalGoogle RobotManipulation
GR00T-N1.6-BEHAVIOR1kGalaxea R1 ProLoco-manipulation
GR00T-N1.6-G1-PnPAppleToPlateUnitree G1Loco-manipulation (Pick & Place)

Full list and usage: GitHub - Isaac-GR00T README


Post-training Notes

  • 5-6% performance variance observed even with identical settings, seed, and dropout
  • N1.6 converges faster than N1.5, increasing overfitting risk
  • Careful hyperparameter tuning required

Supported Robots

Robot platforms validated on N1.6:

PlatformTypeDocumentation
Bimanual YAM robotBimanual manipulator-
AGIBot Genie-1Semi-humanoidAGIBot
Unitree G1HumanoidUnitree Humanoid
Fourier GR-1Humanoid-

Version Comparison Summary

FeatureN1N1.5N1.6
Announced2025.03 GTC2025.05 Computex2025.09 CoRL
Model Size2.2B3B3B
Base VLMEagle2-1B (trainable)Eagle 2.5 (frozen)Cosmos-Reason-2B (top 4 unfrozen)
DiT Layers161632
Action SpaceAbsoluteAbsoluteRelative
Pretraining Steps250K250K300K
Key FeatureBasic VLA, synthetic dataFLARE, language understandingScale-up, Loco-manipulation

Reference: NVIDIA Sim-to-Real Workflow

Note: This section describes NVIDIA’s robotics workflow that can be used with N1.6, not features of the N1.6 model itself.

Details: Building Generalist Humanoid Capabilities with GR00T N1.6 (NVIDIA Developer Blog)

The sim-to-real workflow introduced in NVIDIA’s developer blog includes three components:

ComponentDescription
Whole-Body RLDynamically stable motion primitives trained via RL in Isaac Lab/Sim
COMPASS NavigationSynthetic-data-trained navigation achieving zero-shot sim-to-real transfer
Vision-Based LocalizationCUDA-accelerated libraries: cuVSLAM, cuVGL, FoundationStereo, nvblox

References

GR00T N1.6

News

Base


See Also

GR00T Series

  • Cosmos - N1.6’s VLM (Cosmos-Reason-2B)
  • Eagle - N1, N1.5’s VLM
  • DreamGen - Synthetic Data Generation Pipeline
  • Jim Fan - NVIDIA GEAR Lab, GR00T Research Lead