NVIDIA Cosmos (World Foundation Model)

NVIDIA's World Foundation Model Platform for Physical AI

Key Significance

  • World Model Platform for Physical AI: The first comprehensive world foundation model platform for robots, autonomous vehicles, and video analytics AI
  • Physics-Aware Video Generation: Trained on 9 trillion tokens from 20 million hours of real-world data to generate physically plausible videos
  • Open Model Ecosystem: Tokenizer, Predict, Transfer, and Reason models available under commercially permissive open license
  • Bridging Sim-to-Real Gap: Cosmos Transfer overcomes the visual domain gap between simulated and real environments
  • Integration with GR00T: N1.6 adopts Cosmos-Reason-2B VLM for enhanced robot reasoning capabilities
  • Industry-Wide Adoption: Major companies including 1X, Figure AI, Agility, Waabi, XPENG, and Uber have adopted the platform

Overview

NVIDIA Cosmos is a World Foundation Model (WFM) platform designed to accelerate Physical AI development. Consisting of tokenizers, prediction models, transfer models, and reasoning models, it enables robots and autonomous vehicles to learn in digital environments first and then apply that knowledge to the real world.

ItemDetails
Initial AnnouncementJanuary 6, 2025 (CES 2025)
Major UpdateMarch 18, 2025 (GTC 2025)
CompanyNVIDIA
PaperarXiv:2501.03575
GitHubnvidia-cosmos
LicenseNVIDIA Open Model License (commercially usable)
Training Data9 trillion tokens / 20 million hours of real-world data

Cosmos Product Family

The Cosmos platform consists of four core components:

ProductRoleVersions
Cosmos TokenizerCompress images/videos to tokens0.1, 1.0
Cosmos PredictPredict future frames from text/image/video1.0, 2.0, 2.5
Cosmos TransferSim-to-real conversion, multi-control1, 2.5
Cosmos ReasonPhysical AI reasoning VLM1, 2

Model Size Categories

CategoryDescriptionUse Case
NanoOptimized for real-time, low-latency inferenceEdge deployment
SuperBalance of performance and efficiencyGeneral-purpose baseline
UltraMaximum quality and fidelityCustom model distillation

Cosmos Tokenizer

A neural network-based compression model that efficiently tokenizes images and videos.

Architecture

ItemDetails
StructureSymmetric Encoder-Decoder
Temporal DesignCausal Temporal Convolution + Attention
Preprocessing2-level Haar Wavelet Transform (4x downsampling)
Compression RateSpatial 8x/16x, Temporal 4x/8x, Total up to 2048x

Tokenizer Types

TypeCodeDescriptionUsed By
Continuous ImageCIContinuous latent embeddings (image)Diffusion models
Discrete ImageDIDiscrete tokens (image)Autoregressive models
Continuous VideoCVContinuous latent embeddings (video)Diffusion models
Discrete VideoDVDiscrete tokens (video)Autoregressive models

Key Models

ModelCompressionUse Case
Cosmos-1.0-Tokenizer-CV8x8x88x8x8 = 512xDiffusion WFM
Cosmos-1.0-Tokenizer-DV8x16x168x16x16 = 2048xAutoregressive WFM
Cosmos-0.1-Tokenizer-CI8x88x8 = 64xImage Diffusion
Cosmos-0.1-Tokenizer-DI8x88x8 = 64xImage AR

Performance

MetricValue
Compression vs SOTA8x improvement
Speed vs SOTAUp to 12x faster
Max length at 1080p8 seconds (single A100 80GB)
Max length at 720p10 seconds (single A100 80GB)
Supported aspect ratios1:1, 3:4, 4:3, 9:16, 16:9

Cosmos Predict

World generation models that predict future frames from text, image, and video inputs.

Cosmos Predict 1.0

Diffusion Models

ModelParametersInputOutput
Cosmos-1.0-Diffusion-7B-Text2World7BText121 frames
Cosmos-1.0-Diffusion-14B-Text2World14BText121 frames
Cosmos-1.0-Diffusion-7B-Video2World7BText + Image/Video120 frames
Cosmos-1.0-Diffusion-14B-Video2World14BText + Image/Video120 frames

Architecture:

  • Diffusion Transformer (DiT) based
  • Interleaved Self-Attention + Cross-Attention + FFN structure
  • Adaptive Layer Normalization (AdaLN) for time information embedding
  • LoRA reduces parameters from 11B to 7B (36% reduction) while maintaining performance
  • Tokenizer: Cosmos-1.0-Tokenizer-CV8x8x8

Autoregressive Models

ModelParametersInputOutput
Cosmos-1.0-Autoregressive-4B4BImage (first frame)32 frames
Cosmos-1.0-Autoregressive-12B12BImage (first frame)32 frames
Cosmos-1.0-Autoregressive-5B-Video2World5BText + Image/Video24-32 frames
Cosmos-1.0-Autoregressive-13B-Video2World13BText + Image/Video24-32 frames

Architecture:

  • Llama3-style GPT structure (trained from scratch)
  • Interleaved Self-Attention + FFN structure
  • Video2World: Cross-Attention added via T5 embeddings
  • Tokenizer: Cosmos-1.0-Tokenizer-DV8x16x16
  • Resolution: 1024x640

Cosmos Predict 2.5 (Oct 2025)

The latest world simulation model that unifies Text2World/Image2World/Video2World into a single model.

ModelParametersFeatures
Cosmos-Predict2.5-2B2BOptimized for edge deployment
Cosmos-Predict2.5-14B14BHighest quality

Key Improvements:

  • Flow-matching architecture adopted
  • Uses Cosmos-Reason1 VLM as text encoder
  • Trained on 200 million curated video clips
  • Supports robot action sequence conditioned prediction
  • 7-camera multiview support (for autonomous driving)

Cosmos Transfer

Models that transform simulated environments to photorealistic levels and control video generation through structured inputs (segmentation, depth, edges, etc.).

Cosmos Transfer 1 (Mar 2025)

ItemDetails
PaperarXiv:2503.14492
Base ModelCosmos-Predict1
ArchitectureDiT + ControlNet
Control Blocks3 Transformer blocks
InitializationZero-initialized Linear Layer

Supported Input Modalities:

  • Segmentation video
  • Depth video
  • Edge video
  • Blur video
  • LiDAR video
  • HDMap video (for autonomous driving)

Key Features:

  • Spatiotemporal Control Map: Adjusts spatiotemporal weights for each modality
  • MultiControlNet: Enables simultaneous use of multiple modalities
  • Sim-to-Real Transformation: Converts simulation footage to photorealistic quality

Cosmos Transfer 2.5 (Oct 2025)

Next-generation transfer model based on Cosmos-Predict2.5.

ModelCapability
Cosmos-Transfer2.5World simulation based on multiple spatial control inputs

Cosmos Reason

Reasoning Vision-Language Model (VLM) for Physical AI. Enables robots and AI agents to reason like humans to understand and act in the physical world.

Cosmos Reason 2 (Dec 2025)

ItemDetails
AnnouncementDecember 19, 2025 (CoRL 2025)
CES 2026 ReleaseJanuary 2026
Base ArchitectureQwen3-VL
StructureVision Transformer (ViT) + Dense Transformer LLM
Context LengthUp to 256K tokens

Model Versions

ModelParametersUse Case
Cosmos-Reason2-2B2BEdge/embedded (used in GR00T N1.6)
Cosmos-Reason2-8B8BCloud/high-performance inference

Key Capabilities

CapabilityDescription
Physical Common SenseUnderstanding of space, time, and fundamental physics
Chain-of-Thought ReasoningGenerates embodied decisions through long reasoning processes
Spatiotemporal PrecisionAccurate event tracking based on timestamps
Object Detection2D/3D point localization, bounding boxes + reasoning explanations
Causal AnalysisReasoning about “Why is this happening?” and “What will happen next?”

Use Cases

DomainApplication
Robot PlanningSystem 2 (slow thinking) role in VLA models
Video AnalyticsLarge-scale video insight extraction from urban/industrial environments
Data AnnotationAutomated labeling and description of synthetic/real videos

Integration with GR00T

Cosmos is tightly integrated with NVIDIA’s GR00T humanoid robot foundation model.

Cosmos-Reason-2B in GR00T N1.6

ItemDetails
VLMCosmos-Reason-2B (upgraded from Eagle2-1B)
FeatureNative Resolution support (distortion-free input)
EffectImproved scene understanding and task decomposition

Improvement Effects:

  • 2x larger VLM compared to Eagle2-1B for enhanced visual understanding
  • Native resolution support processes images without padding
  • Better environmental reasoning and situational awareness

Cosmos + GR00T Training Pipeline

Omniverse (Simulation)
    |
Cosmos Predict (Synthetic Data Generation)
    |
Cosmos Transfer (Sim-to-Real Transformation)
    |
Cosmos Reason (Data Labeling/Annotation)
    |
GR00T N1.6 (VLA Training)

Physical AI Applications

Robotics

CompanyApplication
1XTraining NEO Gamma with Cosmos Predict + Transfer
Agility RoboticsLarge-scale synthetic data generation with Cosmos Transfer + Omniverse
Figure AIPhysical AI data pipeline
Skild AIAugmenting synthetic datasets with Cosmos Transfer

Autonomous Driving

CompanyApplication
WaabiAutonomous driving scenario generation
XPENGVehicle AI training data
UberRidesharing autonomous driving research

Timeline

DateEvent
Jan 6, 2025Cosmos platform announced at CES 2025
Jan 7, 2025arXiv paper published (2501.03575)
Mar 18, 2025Major updates announced at GTC 2025
Mar 2025Cosmos-Transfer1 paper released (2503.14492)
Jun 2025Cosmos-Reason-2B integrated into GR00T N1.6
Oct 6, 2025Cosmos-Predict2.5, Transfer2.5 released
Dec 19, 2025Cosmos-Reason2 released (CoRL 2025)
Jan 2026Cosmos Reason 2 officially unveiled at CES 2026

References

Official Resources

Papers

News and Announcements

Models

Technical Blogs


See Also

  • Jim Fan - NVIDIA GEAR Lab, Physical AI Research Lead