GR00T-Dreams (DreamGen)

NVIDIA's synthetic data generation pipeline for robot learning - Neural Trajectory generation using World Foundation Models

Key Significance

  • Solving the Robot Data Problem: Generate large-scale synthetic trajectory data using World Foundation Models from just a single image and language instructions
  • Dramatic Efficiency: Train GR00T N1.5 in just 36 hours (vs. 3 months with manual collection)
  • Combined with GR00T-Mimic: Generate 780K synthetic trajectories in 11 hours (equivalent to 6,500 hours of human demonstration)
  • Behavioral Generalization: Perform 22 new behaviors across 10 novel environments from single-environment pick-and-place data
  • Contact-Rich Task Support: Learn challenging tasks like towel folding, hammering, and bowl stacking that simulation struggles with
  • Open Source: Released under Apache 2.0 license

Overview

GR00T-Dreams is NVIDIA GEAR Lab’s initiative to solve the robotics data problem. The core technology, DreamGen, leverages Video World Models (Cosmos-Predict2) to generate synthetic robot data called Neural Trajectories. It enables robots to “dream” new scenarios and learn from them.

AttributeDetails
ReleaseMay 2025 (arXiv), Computex 2025
Research InstitutionsNVIDIA, UW, KAIST, UCLA, UCSD, CalTech, NTU, UMD, UT Austin
Core TechnologyVideo World Model + Inverse Dynamics Model
Base ModelCosmos-Predict2
PaperarXiv:2505.12705
GitHubNVIDIA/GR00T-Dreams
LicenseApache 2.0

DreamGen: 4-Stage Pipeline

DreamGen is a simple yet highly effective 4-stage pipeline.

Stage 1: Video World Model Fine-tuning

Fine-tune existing Image-to-Video generation models (Cosmos-Predict2) for the target robot embodiment.

AttributeDetails
Base ModelCosmos-Predict2
Fine-tuning MethodLoRA (Low-Rank Adaptation)
PurposeLearn robot dynamics while preserving internet video knowledge

Why LoRA:

  • Prevents catastrophic forgetting of pre-trained internet video knowledge
  • Efficient parameter updates

Stage 2: Synthetic Video Generation (Dream Generation)

Generate synthetic robot videos by prompting the fine-tuned model with initial frames and language instructions.

Input: Initial image + Language instruction ("Pick up the cup and place it on the shelf")
      |
[Fine-tuned Cosmos-Predict2]
      |
Output: Photorealistic robot video (including novel behaviors/environments)

Key Features:

  • Generate novel behaviors not seen during training
  • Generate same behaviors across diverse environments
  • Physically plausible motion generation

Stage 3: Action Extraction (Neural Trajectory Generation)

Since generated videos lack action annotations, extract pseudo-actions.

MethodDescriptionUse Case
Inverse Dynamics Model (IDM)Predict actions between two framesExplicit action extraction
Latent Action ModelAction representation in latent spaceImplicit action representation

IDM Architecture:

  • Diffusion Transformer + SigLIP-2 Vision Encoder
  • Trained with Flow Matching objective
  • Two image frames -> Action chunk prediction
  • No language or proprioception input (learns pure dynamics only)

Result: Neural Trajectories

  • Combination of synthetic video + pseudo-actions
  • Trainable format without real teleoperation data

Stage 4: Policy Training

Train visuomotor policies using Neural Trajectories.

AttributeDetails
Training TargetGR00T N1.x Foundation Model
DataNeural Trajectories (synthetic) + Real Trajectories (optional)
EffectAcquire behavioral and environmental generalization capabilities

GR00T-Dreams Blueprint: 5-Stage Workflow

NVIDIA’s official Blueprint extends to 5 stages:

1. Post-training
   |-- Fine-tune Cosmos-Predict2 with limited teleoperation trajectories

2. Dream Generation
   |-- Generate diverse task scenarios with image + text prompts

3. Reasoning & Filtering (Cosmos-Reason1)
   |-- Evaluate and filter low-quality synthetic data

4. Neural Trajectory Extraction (IDM)
   |-- Convert 2D videos to 3D action sequences

5. Policy Training
   |-- Train visuomotor policies on synthetic dataset

GR00T-Mimic: Trajectory Augmentation

A complementary Blueprint to GR00T-Dreams.

Overview

AttributeDetails
PurposeGenerate large-scale physically accurate trajectories from few human demonstrations
MethodSimulation-based trajectory augmentation (MimicGen, DexMimicGen)
PlatformNVIDIA Isaac Lab + Omniverse

How It Works

  1. Demonstration Collection: Teleoperate simulated robot via Apple Vision Pro or Space Mouse
  2. Keypoint Annotation: Mark key points in demonstrations
  3. Interpolation & Augmentation: Automatically generate physically accurate new trajectories
  4. Automatic Validation: Validate in Isaac Sim and convert to training data

GR00T-Dreams vs GR00T-Mimic

AspectGR00T-DreamsGR00T-Mimic
PurposeNovel behavior/environment generalizationDeepen existing skill proficiency
MethodVideo World ModelSimulation augmentation
Data TypeNeural TrajectoriesSynthetic Trajectories
StrengthContact-rich, novel behaviorsPhysical accuracy, large-scale
Core ToolsCosmos-Predict2Isaac Lab, MimicGen

Complementary Nature

  • GR00T-Mimic: Develop Specialist proficiency for specific skills
  • GR00T-Dreams: Enable Generalist capabilities for new behaviors

Cosmos Transfer: Photorealistic Rendering

Bridges the Sim-to-Real gap in simulation data.

Role

FunctionDescription
Style TransferSimulation footage -> Photorealistic conversion
Lighting/Environment ChangesApply diverse lighting, textures, environments
Structure PreservationMaintain physical dynamics of robot motion

Supported Input Modalities

  • Segmentation video
  • Depth video
  • Edge video
  • Blur video

Effect

According to NVIDIA researchers, using Cosmos-Transfer1:

  • “Adds more scene details and complex shading, natural illumination”
  • Perfectly preserves physical dynamics of robot motion

Data Generation Efficiency

GR00T-Dreams (DreamGen)

MetricValue
GR00T N1.5 Training Time36 hours
Manual Collection Estimate~3 months
Efficiency Improvement~60x

GR00T-Mimic

MetricValue
Trajectories Generated780,000
Generation Time11 hours
Human Demonstration Equivalent6,500 hours (9 months continuous work)
Trajectories per Hour~70,900/hour

Performance Improvement

MetricResult
Synthetic + Real Data CombinedGR00T N1 performance 40% improvement

Supported Robot Embodiments

GR00T-Dreams supports various robot platforms:

RobotTypeDescription
Fourier GR1HumanoidFull-body humanoid robot
Franka Emika PandaSingle ArmStandard research manipulator
SO-100Single Arm$100 low-cost robot arm
Unitree G1HumanoidFirst real-world training data included
RoboCasaSimulationHome environment simulation

Extensibility:

  • Custom embodiment support available (requires metadata + data config files)
  • Multi-camera view support (e.g., wrist cameras)

DreamGen Bench

Benchmark for evaluating quality of generated videos.

Evaluation Metrics

MetricEvaluation ModelDescription
Instruction FollowingQwen2.5-VL / GPT-4oLanguage instruction compliance
Physics AlignmentQWEN-VLPhysical realism

Evaluation Targets

  • 4 video generation models
  • Various robot configurations

Integration with GR00T Series

GR00T N1

AttributeDetails
UsageGR00T-Mimic (simulation synthetic data)
LimitationWeak generalization - only performs pre-training tasks

GR00T N1.5

AttributeDetails
UsageFull GR00T-Dreams Integration
EffectDreamGen Tasks success rate: 13.1% -> 38.3%
Training Time36 hours (vs. 3 months manual)
Generalization22 new behaviors, 10 new environments

GR00T N1.6

AttributeDetails
UsageExtended GR00T-Dreams application
VLMUpgraded to Cosmos-Reason-2B
EffectEnhanced reasoning and planning capabilities

Industry Adoption

CompanyApplication
1XNEO Gamma humanoid training
Agility RoboticsLarge-scale synthetic data generation
Skild AISynthetic dataset augmentation
AgiBotLarge-scale trajectory generation with GR00T-Mimic

References

Official Resources

Paper

Technical Blogs

News


See Also

GR00T Series

  • Cosmos - World Foundation Model Platform
  • Eagle - Vision-Language Model
  • Jim Fan - NVIDIA GEAR Lab, GR00T Research Lead