Teleoperation Approach

Action data collection through remote operation

Overview

Teleoperation is a method where a human remotely controls a robot while simultaneously collecting motion data. Modern teleoperation traces back to 1954, when Raymond Goertz invented a mechanical Master-Slave manipulator for safely handling radioactive materials.

While past teleoperation was simply a remote control mechanism that executed human commands at a distance, today’s teleoperation has transformed into a data collection pipeline for Foundation Model training. Every moment an operator controls a robot — the visual observations, proprioception, and motion trajectories — becomes AI training data.

ROBOTIS OMY

The most intuitive example of Leader-Follower teleoperation — move the small Leader robot, and the large Follower robot mirrors it.


Why Teleoperation Matters

As covered in Action Data Scaling Problem, VLA (Vision-Language-Action) models cannot source training data from the internet like LLMs can. Robot motion data can only be generated by physically moving robots.

Teleoperation is the dominant data collection approach today for clear reasons:

  • High-quality (observation, action) pairs: Robot sensor readings and operator commands are recorded in perfect synchronization.
  • Precise action space definition: Joint positions, velocities, and torques are directly recorded in a learning-ready format.
  • Simultaneous language labels: Google’s RT-1 recorded text instructions alongside teleoperation demonstrations to collect 130K episodes. This is critical for multi-task generalization.
  • Failure/correction data inclusion: Recording not just successes but also failures and recovery trajectories enables robots to learn robust policies for out-of-distribution situations.

Most major AI robotics companies — Tesla, Google DeepMind, Physical Intelligence, 1X Technologies, Galaxea — collect their core training data through teleoperation.

Data Pyramid

Robot learning data pyramid. Real-World Data at the top is the most expensive but highest-quality data, and teleoperation is the primary method for collecting it. Lower tiers offer more volume but are less directly usable for robot behavior learning. (Source: NVIDIA GR00T N1 Paper)


Modality Classification

Teleoperation modalities vary by input device, control scheme, and feedback level.

ModalityRepresentative SystemCost RangeKey Feature
Leader-FollowerALOHA, GELLO, ROBOTIS OMY$300–$20KNo IK needed, physically feel joint limits
VR/Motion CaptureOpen-TeleVision, Bunny-VisionPro$500–$10KImmersive 3D control, remote operation
ExoskeletonHOMIE, CHILD, HumanoidExo$300–$5KIsomorphic design eliminates retargeting, full-body control
Glove-basedDOGlove, SenseGlove R1, HaptX G1$600–$18KHand tracking + haptic feedback, precision manipulation
In-Simulation TeleopNVIDIA GR00T-Teleop, RoboCasaSW cost24/7 collection, synthetic data amplification

Leader-Follower (Kinematic Replica)

Consists of a small Leader robot that the operator physically moves and a large Follower robot that mirrors it in real-time. Since Leader and Follower share identical kinematics, joint angles map 1:1 without inverse kinematics, and operators physically feel joint limits and singularities.

Stanford’s ALOHA ($20K) and UC Berkeley’s GELLO (under $300, 3D-printed) are representative examples.

GELLO — A universal Leader-Follower teleoperation interface that can be 3D-printed for under $300. (Source: GELLO Project Page)

VR / Motion Capture

Operators wear VR headsets (Meta Quest, Apple Vision Pro) or motion capture suits to control robots in 3D space. Arm and wrist movements map to robot trajectories, with the major advantage of enabling remote operation.

Open-TeleVision (UCSD/MIT) uses Apple Vision Pro to stream stereo video while controlling robots at 60Hz. It demonstrated cross-continent teleoperation with an MIT operator controlling a Unitree H1 at UC San Diego.

Bunny-VisionPro is a bimanual dexterous teleoperation system using Apple Vision Pro. It provides tactile feedback through low-cost haptic devices (ERM vibration motors) and includes built-in collision avoidance and singularity checks for safe control. Demonstration data collected through the system has been used to train ACT and Diffusion Policy models, achieving strong spatial generalization on multi-stage dexterous tasks like skincare and kitchen operations. (arXiv:2407.03162)

Exoskeleton

Wearable exoskeleton devices that control robots through the operator’s body movements. When designed with kinematics identical to the robot (“isomorphic”), joint angles transfer directly without retargeting computation, and operators receive passive proprioceptive feedback.

Full-body exoskeletons

  • HOMIE (Shanghai AI Lab, 2025): ~$500 cockpit system. Isomorphic exoskeleton arms + sensing gloves + foot pedals for whole-body teleop. Validated on Unitree G1, Fourier GR-1. Halves task time vs. prior systems, 70%+ imitation learning success. Open-source.
  • CHILD (UIUC, Humanoids 2025): Baby-carrier-sized compact full-body teleop system. 14ms latency. Validated on Unitree G1, Boston Dynamics Spot.
  • HumanoidExo (2025): Wearable exoskeleton + back-mounted LiDAR for 6D pose tracking. Just 5 real demos + 195 exoskeleton sessions achieve 80% pick-and-place success. Proposes HE-VLA (VLA + RL balance) pipeline.

Upper-body exoskeletons

  • NuExo (IROS 2025): 5.2kg backpack-mounted. Covers 100% of upper limb ROM. Simultaneously captures robot teleop data, upper limb kinematics, first-person video, finger motion, and force feedback.
  • AirExo: $300/arm, 3D-printed, open-source. Specialized for “in-the-wild” data collection.

Between 2024–2026, exoskeleton costs plummeted from tens of thousands to $300–$500, with HOMIE, AirExo, and HumanoidExo all released as open-source.

Glove-based

Glove interfaces combining finger tracking with haptic feedback, used for dexterous robot teleoperation and imitation learning data collection.

ProductSensingDoFHapticPriceFeatures
HaptX G1Magnetic MoCapFullPneumatic 135-point + 40lb force$5,495+ROS 1/2 support, best tactile fidelity
SenseGlove R1Exo encodersFull + forceActive force feedback, 1,000HzUndisclosedPurpose-built for humanoid teleop (shipping Jan 2026)
DOGloveCustom joints21Cable-driven 5-DoF force + 5 LRAs$600Open-source, LEAP Hand validated (RSS 2025)
GEX EX12Exo encoders12Electromechanical force$6003D-printed, $1,200 with GX11 hand set
MANUS MetaglovesEMF25Optional$8,879+No drift, precision MoCap
Dexmo11 rotary sensors115 servo motorsEnterprise290g, wireless

SenseGlove R1 is the first commercial glove designed specifically for humanoid teleoperation and imitation learning data collection, with initial shipments beginning January 2026.


Whole-Body Teleoperation

Unitree Embodied Avatar — An operator wearing a motion capture suit (23–43 trackers) mirrors a G1 humanoid in real-time. Tasks include soccer, martial arts, dishwashing, and laundry folding. Collected data feeds UnifoLM model training.

The rise of humanoid robots has introduced a new dimension to teleoperation: simultaneously controlling locomotion + upper-body manipulation + dexterous hands. These subsystems are physically coupled — extending arms shifts the center of mass, and walking destabilizes the upper body.

Balance Control: Who Maintains Equilibrium?

The core challenge of whole-body teleop is “preventing the robot from falling when the operator reaches forward.” Three main approaches exist today.

Decoupled: Lower body runs an autonomous RL locomotion policy while only the upper body is teleoperated. The operator focuses solely on manipulation, but coordinated whole-body motions like bending to pick objects off the floor are not possible.

  • Mobile-TeleVision: Uses a CVAE to predict upper-body motion and feed it to the lower-body locomotion policy, maintaining walking stability. Validated on Unitree H1.
  • HOMIE: A $500 cockpit — pedals for locomotion, isomorphic exoskeleton arms for upper body, sensing gloves for hands. Halves task completion time vs. prior systems, 70%+ imitation learning success. Open-source.
  • Unitree xr_teleoperate: Apple Vision Pro/Meta Quest for upper body, R3 gamepad for walking. Unitree’s built-in RL locomotion policy maintains balance.

Unified: A single RL policy controls the entire body while tracking teleop targets. Enables coordinated whole-body motions, but trades off between manipulation precision and stability.

  • HumanPlus (Stanford): Tracks full body from a single RGB camera. RL policy trained on 40 hours of MoCap data implicitly maintains balance. Demonstrated shoe-wearing, sweater folding on a custom 33-DOF humanoid.
  • TWIST (CoRL 2025): Single neural network handles all whole-body skills — manipulation, walking, dancing. Validated on Unitree G1. Open-source.

Slow-Fast Dual Frequency: Separates lower body (50Hz) and upper body (100Hz) into agents at different frequencies.

  • SoFTA (CMU): Unitree G1 carries water without spilling — end-effector acceleration <2 m/s² (near human-level stability). 50–80% reduction in hand acceleration.

Motion Retargeting: Mapping Different Body Proportions

Mapping a 180cm human’s motion to a 130cm Unitree G1 requires compensating for differences in arm length, leg length, and shoulder width.

  • IK-based mapping: Matches end-effector positions rather than joint angles. Most common approach. (Mobile-TeleVision, Open-TeleVision)
  • SMPL-based body fitting: Optimizes human body model (SMPL) parameters to match robot kinematics, then transforms motion. (H2O)
  • RL implicit learning: RL policies naturally learn retargeting during large-scale MoCap data training. (HumanPlus, OmniH2O)
  • GMR (General Motion Retargeting): Real-time universal retargeting on CPU across diverse humanoids. Used by TWIST. (GMR, ICRA 2026)

Dexterous Hand Teleoperation

Human hands have ~27 DOF while robot hands range from 6–24 DOF, with entirely different kinematic structures. Thumb opposition mechanics, joint ranges, and finger length ratios all differ, making simple joint angle copying impossible. The key principle is “preserving fingertip positions and contact states, not joint angles.”

Tracking Methods

MethodAccuracy (RMSE)LatencyCostRepresentative
Vision-based~22.5° (MediaPipe), moderate (AVP)Low–MedFree–$3,500MediaPipe, Apple Vision Pro, Leap Motion
Glove-based<10°Very Low$600–$18KDOGlove ($600), SenseGlove R1 (1,000Hz), HaptX G1
ExoskeletonHigh (direct mapping)Very LowCustomDexUMI, DEXOP, GEX

Vision-based is the most accessible but suffers from self-occlusion issues. Gloves are accurate but require per-user calibration. Apple Vision Pro tracks 26 joints but experiences 100–200ms latency when fingers occlude each other. EMF-based gloves (MANUS Metagloves) have no drift, while IMU-based sensors drift ~6.6°/hour.

Retargeting Algorithms

Retargeting algorithms are rapidly evolving:

  • Geometric optimization (AnyTeleop, DexPilot): Minimizes vector differences between corresponding keypoints on human and robot hands. Solved via SQP at ~100Hz.
  • Contact-aware retargeting (DexFlow): Preserves contact states (which fingers are touching) rather than geometric shape. Critical for grasping tasks.
  • Learning-based (GeoRT): Per-finger MLP networks for direct mapping. 1KHz inference (10x faster than optimization), trained unsupervised from 5 minutes of human MoCap + robot space sampling. Validated on Allegro and LEAP Hand.

DexUMI: A Paradigm Shift

DexUMI (Stanford, CoRL 2025 Best Paper Finalist) bypasses the retargeting problem entirely:

  1. Kinematics gap: A wearable exoskeleton is optimized to match the target robot hand’s kinematics. Natural human manipulation directly generates robot joint commands.
  2. Visual gap: SAM2 removes human hand + exoskeleton → ProPainter restores background → robot hand imagery composited. Training data looks like the robot captured it.

Results: 86% average task success rate, 3.2x data collection efficiency over traditional teleop. Validated on both 6-DOF underactuated (Inspire Hand) and 12-DOF fully actuated (XHand).

Similar approaches include DEXOP (2025), which uses a passive exoskeleton for in-the-wild data collection (2.4x faster than traditional teleop), and GEX (2025), offering a $1,200 3D-printed robot hand (GX11, 11 DOF) + exoskeleton glove (EX12, 12 DOF) set.

Representative Robot Hands

Robot HandDOFPriceTeleop Method
Shadow Dexterous Hand24$100K+CyberGlove, vision-based neural networks, AnyTeleop
Allegro Hand16$15K–$20KDexPilot, AnyTeleop, OpenTeach (Meta Quest 3)
LEAP Hand (CMU)16$2K–$3K3D-printed, 4-hour assembly, outperforms hands costing 10–100x more
Inspire Hand6Low-costDexUMI exoskeleton, Unitree G1 compatible

Haptic Feedback

Why It Matters

Operators who cannot feel force apply excessive force, crush fragile objects, or incorrectly insert connectors. This directly affects training data quality.

An IEEE Transactions on Haptics (2024) study found that even simple vibrotactile feedback improved data quality by 20% and imitation policy performance by 11% overall — 24% on difficult tasks. (Leveraging Haptic Feedback to Improve Data Quality and Quantity for Deep Imitation Learning Models)

In medicine, Intuitive Surgical’s da Vinci 5 (2024) became the first FDA-approved surgical robot with tactile feedback, reducing peak tissue forces by 43%.

Equipment Spectrum

TypeRepresentativePriceFeedback Level
VibrotactileOculus controller, bHaptics TactSuit$300–$500Indirect, high-frequency contact detection
Grounded hapticForce Dimension sigma.7, Haply Inverse3$10K–$100KPrecise force feedback
Pneumatic glovesHaptX G1$5,495+135 points, 40lb force, best tactile fidelity. ROS 1/2
Force feedback glovesSenseGlove R1UndisclosedActive force feedback, 1,000Hz, humanoid teleop purpose-built
EM brake glovesSenseGlove Nova 2$5,99920N per finger, palm pressure
Cable-driven glovesDOGlove$6005-DoF force feedback + 5 LRAs. Open-source
Passive exoskeletonDexUMI, DEXOPCustomNatural tactile via direct object contact. No actuators needed
Pseudo-hapticsVisual manipulation (C/D ratio), vibrotactile substitutionNo extra costVisual illusion simulates weight perception

ALOHA provides only passive mechanical feedback through backdrivable motors — no active force feedback. However, IGBT (Input-Gated Bilateral Teleoperation) (2025) demonstrated adding bilateral force feedback to ALOHA hardware without force sensors.

Latency and Stability

In bilateral haptic teleoperation, communication delay doesn’t just cause discomfort — it causes system instability. Delayed force transmission creates positive feedback loops that can inject energy into the system.

  • <50ms: Safe for most bilateral control architectures
  • 50–300ms: Stabilization techniques required (Wave Variables, TDPA)
  • >500ms (Earth-Moon): Only model-based predictive control is feasible

This can be mitigated by Shared Autonomy — the human specifies intent/goals while the robot automatically handles collision avoidance and alignment.


In-Simulation Teleoperation

Why Teleoperate in Simulation

In-simulation teleoperation enables data collection without physical robots, and exponential amplification of collected data.

  • No hardware wear: 24/7 collection, no maintenance or part replacement needed
  • Safety: Dangerous motions have no consequences
  • Automatic domain randomization: Lighting, textures, object poses, friction coefficients varied automatically
  • Synthetic data amplification: A handful of human demonstrations multiplied by orders of magnitude

NVIDIA GR00T Blueprint

NVIDIA’s Isaac GR00T Blueprint provides a complete in-simulation teleop pipeline:

  1. GR00T-Teleop: Operator wears Apple Vision Pro and controls a digital twin robot in Isaac Lab via CloudXR
  2. GR00T-Mimic: Generates massive synthetic trajectories by varying object positions, masses, friction from collected demonstrations. 780,000 trajectories in 11 hours (equivalent to 9 months / 6,500 hours of real collection)
  3. GR00T-Gen: NVIDIA Cosmos Transfer converts simulation renders to photorealistic imagery for visual diversity

Combining synthetic and real data improves GR00T N1 model performance by 40%.

Other Simulation Platforms

  • RoboCasa + MimicGen: MuJoCo-based, 365 household tasks. 50 demonstrations per task via 3D SpaceMouse → MimicGen auto-amplifies to 3,000 trajectories. Latest RoboCasa365 includes 2,200+ hours of data.
  • MuJoCo Playground: Includes ALOHA bimanual environment, policy training in minutes on a single GPU. RSS 2025 Outstanding Demo Award.
  • Genesis: Universal physics engine, 43 million FPS on a single RTX 4090 (430,000x real-time). Multi-solver: rigid bodies, fluids, deformables.
  • AGIBOT Genie Sim 3.0: Open-source, PICO VR teleop, 10,000+ hours synthetic data, LLM-driven scene generation. Unveiled at CES 2026.

Sim-to-Real Gap

The biggest challenge with simulation data is the gap with reality. Current solutions:

  • Domain randomization: Vary visual (lighting, textures) + physics (friction, mass, control delays) during training
  • NVIDIA Cosmos Transfer: Generative model converting simulation renders to photorealistic imagery
  • TRANSIC (Stanford, CoRL 2025): Deploy sim policy, human provides real-time corrections to learn residual policy. 77% success rate (vs. 18% best baseline).

Limitations and Challenges

  • High labor costs: Tesla pays $48/hr to operators in 3 shifts. Large-scale collection means large-scale labor costs.
  • Slow collection speed: A scaling bottleneck. Non-Teleop approaches and simulation amplification are being researched to compensate.
  • Human-robot morphology mismatch: Retargeting errors are inevitable. Thumb kinematics are the hardest to map.
  • Latency: Bilateral haptics become unstable in remote environments. Shared Autonomy or predictive displays needed.
  • “Success-only saving” bias: Commonly, failure data is discarded, but this causes policy collapse in OOD situations. Failures, corrections, and hesitations should be included for robust learning.
  • Calibration/synchronization: Multi-camera, wrist camera, depth, and state timestamps must align — otherwise it’s noise injection, not learning. DROID addressed camera calibration improvements in a separate update.

Key Examples

Stanford ALOHA / ALOHA 2

ALOHA is an open-source bimanual teleoperation system under $20K that, together with the ACT (Action Chunking with Transformers) algorithm, democratized robot learning research. ALOHA 2 (Google DeepMind) reduced gripper operation force from 14.68N to 0.84N (10x improvement) and doubled max gripping force from 12.8N to 27.9N. A precise MuJoCo model is available in MuJoCo Menagerie. ALOHA Unleashed demonstrated complex tasks like tying shoelaces and hanging shirts.

Tesla Optimus

Tesla operates 50+ operators at $48/hr in 3 shifts collecting Optimus training data. Initially using motion capture suits, they transitioned to Vision-Only in mid-2025. Operators wear helmets with 5 cameras and repeat daily actions (wiping tables, lifting cups, pulling curtains). This eliminates equipment bottlenecks for workforce scaling, though concerns about missing tactile information remain.

Physical Intelligence (pi0)

Physical Intelligence’s pi0 is a VLA flow model trained on teleop data from 7 robot platforms across 68 tasks. It supports DROID (Franka), ALOHA (low-cost bimanual), Bimanual Trossen, Bimanual ARX, and more. New tasks can be fine-tuned with just 1–20 hours of teleop data. pi0.5 achieved open-world generalization, cleaning up entirely new kitchens and bedrooms.

Google DeepMind (Open X-Embodiment)

Open X-Embodiment is the largest robot dataset ever assembled: 34 labs, 22 robot embodiments, 1M+ trajectories. It encompasses 500+ skills and 150,000+ tasks, serving as the training foundation for RT-1 and RT-2 models.

1X Technologies (NEO)

NEO is a home humanoid robot priced at $20,000, shipping in 2026. Its unique strategy: deploy robots in consumer homes, and when the robot encounters unknown tasks, a remote operator performs them via VR “Expert Mode.” During these sessions, NEO’s AI observes and learns, gradually transitioning to autonomous operation. Over 10,000+ hours already collected. CEO Bernt Bornich: “If we don’t have your data, we can’t make the product better.”

NVIDIA GR00T

GR00T N1 (2B parameter VLA) was pretrained with ~50,000 H100 GPU hours. It provides the complete pipeline: Apple Vision Pro in-sim teleop → GR00T-Mimic synthetic amplification → Cosmos Transfer visual diversity. Internal teleop data expanded from 88 hours to 827 hours (10x amplification). GR00T N1.6 trained on thousands of hours of diverse teleop data.

Open-TeleVision

Open-TeleVision (UCSD/MIT, CoRL 2024) is an immersive teleoperation system using Apple Vision Pro. An active stereo camera on the robot head tracks the operator’s head movements, with the full loop running at 60Hz. MIT’s Ge Yang remotely operated a Unitree H1 at UC San Diego from the East Coast — demonstrating cross-continent teleoperation. Open-source.

Unitree Embodied Avatar

Unitree’s Embodied Avatar combines a motion capture suit (23–43 trackers) with 5G edge computing for full-body teleoperation. The G1 robot (23–43 DOF, up to 120 N·m joint torque) mirrors operator movements with millisecond-level latency. Tasks include soccer, martial arts, dishwashing, and laundry folding. Collected data feeds UnifoLM-VLA (general-purpose manipulation VLA) and UnifoLM-WMA (world-model-based policy) training. Beta keys for labs and businesses were distributed from November 2025, with a planned multi-robot mode allowing one operator to control up to 5 G1s simultaneously.

HOMIE

HOMIE (RSS 2025) is a ~$500 cockpit system: pedals (locomotion) + isomorphic exoskeleton arms (7-DOF, DYNAMIXEL servos) + Hall-sensor sensing gloves (15+ DOF) for whole-body teleop. Halves task completion time, 70%+ imitation learning success. Open-source.


From Data to Autonomy: Learning Pipeline

The path from teleoperation data to autonomous behavior:

Teleop demonstrations → Behavior Cloning (BC) → Deployment/failures → DAgger correction → VLA Foundation Model → Self-Improvement
  • Behavior Cloning: Supervised learning on teleop (observation, action) pairs. Quick baseline policy. However, suffers from covariate shift — errors compound in unseen states.
  • DAgger: When an imperfect policy fails during execution, the operator intervenes to collect correction data. The robot learns “how to recover from mistakes.”
  • VLA Foundation Model: Integrated vision + language + action learning at scale. GR00T N1, pi0, Helix, etc.
  • Self-Improvement: Trained models practice autonomously. Steps-to-go prediction extracts intrinsic reward functions for reinforcement learning without human supervision.

Scaling strategies branch into three paths:

  • Real + synthetic amplification: NVIDIA GR00T (11 hours → 780K trajectories), RoboCasa + MimicGen (50 → 3,000 trajectories)
  • Distributed crowdsourcing: DROID (institution-distributed, Oculus Quest 2), RoboTurk (smartphone-based), BridgeData V2 (60K+ trajectories)
  • Consumer deployment: 1X NEO (home robot deployment, VR Expert Mode)

See Also

See Also