Eagle (NVIDIA Eagle VLM)

NVIDIA's frontier vision-language model series based on data-centric strategies

Author’s Opinion

  • Used as the VLM for the GR00T series. Different tasks require different visual understanding capabilities, solved by using multiple Vision Encoders in parallel.
  • Interesting that without complex fusion, simple concatenation allows the LLM to selectively utilize the information it needs.

Key Significance

  • Mixture of Encoders: Uses multiple Vision Encoders in parallel for complementary visual understanding
  • Data-Centric Approach: Focuses on post-training data strategies over architecture
  • Visual Brain for GR00T Series: Adopted as VLM in N1 (Eagle2-1B) and N1.5 (Eagle 2.5)
  • Efficient Performance: Eagle2-9B matches 70B-class models

Overview

Eagle is NVIDIA’s frontier vision-language model (VLM) series. Based on a “Data-Centric” philosophy, it combines post-training data strategies, vision-centric model design, and scalable training techniques to achieve frontier-level performance with competitive parameter efficiency.

ItemDetails
DeveloperNVIDIA
Initial ReleaseAugust 2024 (Eagle v1)
LicenseCode: Apache 2.0, Model: CC-BY-NC-4.0
GitHubNVlabs/EAGLE
DemoHuggingFace Demo

Versions

Eagle (v1) - August 2024

Exploration of Mixture of Encoders design space for multimodal LLMs. Selected as ICLR 2025 Spotlight.

ItemDetails
ReleaseAugust 28, 2024
PaperarXiv:2408.15998
Key ContributionMixture of Encoders design, Pre-Alignment technique
AwardICLR 2025 Spotlight

Key Findings:

  • Simple concatenation of visual tokens from complementary vision encoders is as effective as complex mixing architectures
  • Introduction of Pre-Alignment to enhance coherence between vision encoders and language tokens
  • Enhanced visual perception contributes to reduced hallucination and improved OCR performance

Eagle 2 Series - January 2025

Methodology for building post-training data strategies from scratch.

ItemDetails
ReleaseJanuary 20, 2025
PaperarXiv:2501.14818
Key ContributionData-centric strategy, Tiled Mixture of Encoders

Eagle2 Model Lineup

ModelLLM BackboneVision EncoderParametersContext
Eagle2-1BQwen2.5-0.5B-InstructSigLIP1B16K
Eagle2-2BQwen2.5-1.5B-InstructSigLIP2B16K
Eagle2-9BQwen2.5-7B-InstructSigLIP + ConvNeXt8.9B16K

Eagle2 Benchmark Performance

Eagle2-9B vs Competing Models:

BenchmarkEagle2-9BQwen2-VL-7BGPT-4V
DocVQA92.6%94.5%88.4%
ChartQA86.4%83.0%-
OCRBench868845-
MMMU56.1%54.1%-
MathVista63.8%58.2%-
MMStar62.6%60.7%-

Eagle2-1B Benchmark:

BenchmarkEagle2-1B
DocVQA81.8%
ChartQA77.0%
TextVQA76.6%
OCRBench767
AI2D70.9%

Eagle2-2B vs InternVL2-2B:

BenchmarkEagle2-2BInternVL2-2B
TextVQA79.1%73.4%
OCRBench818784
MME2109.81876.8
MMStar56.4%50.1%

Eagle 2.5 - April 2025

Frontier VLM for Long-Context Multimodal Learning.

ItemDetails
ReleaseApril 21, 2025
PaperarXiv:2504.15271
Key Contribution512-frame video support, Eagle-Video-110K dataset

Eagle 2.5 Model

ModelLLM BackboneVision EncoderParametersVideo Frames
Eagle2.5-8BQwen2.5-7B-InstructSigLIP2-So400m8BUp to 512

Key Technical Innovations

TechnologyDescription
Image Area Preservation (IAP)Tiling optimization to maximize preservation of original image area and aspect ratio
Automatic Degrade Sampling (ADS)Dynamic balancing of visual/text input while ensuring text completeness
Progressive TrainingGradual context length expansion from 32K to 128K
Eagle-Video-110K110K video dataset with story/clip-level annotations

Eagle 2.5 Benchmarks

Video Understanding Benchmarks:

BenchmarkEagle2.5-8BGPT-4oQwen2.5-VL-72B
Video-MME (w/o sub)72.4%71.9%-
Video-MME (w/ sub)75.7%77.2%-
MLVU77.6%--
LVBench66.4%66.7%-
EgoSchema72.2%-72.2%

Image Benchmarks:

BenchmarkEagle2.5-8BGPT-4o
DocVQA94.1%92.8%
OCRBench869736
TextVQA83.7%77.4%

Architecture

Mixture of Encoders

Eagle’s core architecture is using multiple Vision Encoders in parallel.

Eagle Training Pipeline

Key finding: “Simply concatenating visual tokens from complementary vision encoders is as effective as complex mixing architectures.”

Parallel Encoding Structure (Eagle2-9B)

SigLIP and ConvNeXt operate in parallel, not in series. The same image is fed to both encoders independently and processed separately before being combined via Channel Concatenation.

EncoderRoleStrengths
SigLIPGlobal semantic understandingVision-Language alignment, general visual perception
ConvNeXtLocal detailed featuresOCR, chart/document understanding, high-res details

Both perspectives independently interpret the image, and the LLM selectively utilizes needed information from the concatenated features.

Components

ComponentRole
SigLIPVision-Language aligned vision encoder. Global semantic understanding
ConvNeXt-XXLargeCNN trained on LAION-2B. Local feature extraction (Eagle2-9B only)
MLP ProjectorAlign vision embeddings with LLM representation space
Qwen2.5LLM backbone (0.5B/1.5B/7B selected by version)

Integration with GR00T

Eagle VLM serves as the System 2 (The Thinker) in NVIDIA’s GR00T robot foundation model.

GR00T N1 (March 2025)

ItemDetails
VLMEagle2-1B (Trainable)
Total Parameters2.2B (VLM: 1.34B)
RoleEnvironment perception and language instruction understanding
TrainingVLM fine-tuned together with entire model

GR00T N1.5 (May 2025)

ItemDetails
VLMEagle 2.5 (Frozen)
Key ChangeFreezing VLM to preserve language understanding capability
ResultLanguage following rate 46.6% -> 93.3%
Grounding40.4 IoU (Qwen2.5VL: 35.5)

Role in System 1, 2 Structure

GR00T N1 Architecture

GR00T N1 Architecture: Eagle VLM serves as System 2

  • System 2 (Eagle): Environment perception, language instruction understanding, action planning
  • System 1 (Diffusion): Continuous action generation, real-time control

Version Comparison Summary

ItemEagle (v1)Eagle 2Eagle 2.5
Release2024.082025.012025.04
PaperarXiv:2408.15998arXiv:2501.14818arXiv:2504.15271
LLM-Qwen2.5 (0.5B~7B)Qwen2.5-7B
VisionMixture of Encoders explorationSigLIP (+ConvNeXt)SigLIP2
Key ContributionArchitecture designData strategyLong-Context
Video-64 frames512 frames
GR00T Integration-N1 (Eagle2-1B)N1.5

Training Infrastructure

ItemEagle2-9B
GPU256x H100
Training TimeTens of hours
PrecisionBF16

References

Papers

Models

Code


See Also

  • SigLIP - Google’s Vision-Language encoder
  • Qwen2.5 - Alibaba’s LLM series
  • ConvNeXt - Meta’s modernized CNN architecture
  • OpenCLIP - Framework used to train ConvNeXt-XXLarge
  • LAION-2B - Large-scale image-text dataset used for ConvNeXt-XXLarge training