Diffusion Policy

Columbia/MIT's Diffusion-based Visuomotor Policy Learning

Diffusion Policy

Home > Models > Diffusion Policy


Key Significance

  • Pioneering Diffusion Application to Robotics: First successful application of diffusion (successful in image generation) to robot action generation
  • Natural Multimodal Action Handling: Learns multiple modes when multiple valid actions exist in the same situation, commits to one at execution
  • Highly Stable Training: Very stable training convergence compared to existing imitation learning methods
  • 46.9% Average Performance Improvement Across 4 Benchmarks: Validated on Robomimic, IBC, Behavior Transformer, Relay Policy Learning, etc.
  • Significant Influence on Follow-up Research: Directly influenced action generation methods of subsequent VLAs like pi0’s flow matching and Octo’s diffusion decoder
  • LeRobot Default Support: One of the default supported models in HuggingFace LeRobot alongside ACT
  • Robustness: Robust performance against occlusion, perturbation, and visual distractions

Diffusion Policy: Progressively generates action sequences from noise


Overview

Diffusion Policy is a new approach that represents robot visuomotor policies as conditional denoising diffusion processes. It elegantly handles multimodal action distributions, is well-suited for high-dimensional action spaces, and demonstrates excellent training stability.

ItemDetails
PublishedMarch 2023 (RSS 2023)
AuthorsCheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Shuran Song et al.
AffiliationColumbia University, MIT, Toyota Research Institute
PaperarXiv:2303.04137
JournalIJRR 2024 (Extended version)
Projectdiffusion-policy.cs.columbia.edu

Key Ideas

Diffusion for Action Generation

Unlike traditional policies that directly predict actions, Diffusion Policy progressively generates actions starting from noise.

Pure Noise → ... → Intermediate Noise → ... → Final Action Sequence
          ← Denoising Steps (Langevin dynamics) ←

Core Principle:

  • Learns the score function gradient of action distribution
  • Iteratively optimizes via stochastic Langevin dynamics at inference

Multimodal Action Handling

When multiple valid actions exist in the same situation (e.g., push object left or right), Diffusion Policy:

  • Learns multi-mode behaviors
  • Commits to one mode per rollout
  • Outperforms existing methods like LSTM-GMM and IBC

Receding Horizon Control

Predicts action sequences rather than single actions for temporal consistency.


Architecture

Time-Series Diffusion Transformer

ComponentDescription
Visual EncoderEncoder for image conditioning
Diffusion BackboneTransformer or CNN based
Noise SchedulerDDPM-based scheduling

Inputs:

  • Visual observations (images)
  • Current robot state

Outputs:

  • Future action sequence

Performance

Benchmark Results

46.9% average performance improvement across 4 benchmarks and 12 tasks

BenchmarkTasks
RobomimicLift, Can, Square, Tool Hang, Transport
IBCPush-T, Block Pushing
Behavior TransformerFranka Kitchen
Relay Policy LearningFranka Kitchen

Real Robot Validation

TaskDescription
Push-TT-shaped object pushing manipulation
Mug FlippingFlipping a mug
Sauce PreparationSauce preparation (6-DoF control)

Advantages

FeatureDescription
MultimodalHandles multi-valid action distributions
High-dimensionalWell-suited for high-dimensional action spaces
StabilityStable training convergence
RobustnessRobust to occlusion, perturbation, visual distractions

Comparison with ACT

ItemDiffusion PolicyACT
Generation MethodDenoising diffusionCVAE decoder
MultimodalityNatural handlingStyle variable (z)
Inference SpeedRequires multiple denoising stepsSingle forward pass
Training StabilityVery highHigh

Impact

Diffusion Policy is a pioneering work applying diffusion models to robot learning, influencing many subsequent studies:

  • Flow matching-based approach in pi0 (Physical Intelligence)
  • Default supported model in LeRobot
  • Standard baseline in various manipulation tasks

Resources


See Also