SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control¶
Authors: Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castaneda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Zi Wang, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi "Jim" Fan, Yuke Zhu
Published: 2025 (Preprint)
Source: arXiv
Algorithm: SONIC
arXiv: 2511.07820
Summary¶
SONIC argues that motion tracking is a scalable pretraining task for humanoid whole-body control, then demonstrates gains from increasing controller size, motion-capture data volume, and training compute. The system is useful beyond tracking itself: a universal kinematic planner and unified token space let the same policy interface with VR teleoperation, human video, and VLA-style inputs, making motion priors a foundation layer for natural humanoid behavior.
Abstract¶
Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs over several days. We show that scaling up model capacity, data, and compute yields a generalist humanoid controller capable of creating natural and robust whole-body movements. Specifically, we posit motion tracking as a natural and scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (from 1.2M to 42M parameters), dataset volume (over 100M frames, 700 hours of high-quality motion data), and compute (9k GPU hours). Beyond demonstrating the benefits of scale, we show the practical utility of our model through two mechanisms: (1) a real-time universal kinematic planner that bridges motion tracking to downstream task execution, enabling natural and interactive control, and (2) a unified token space that supports various motion input interfaces, such as VR teleoperation devices, human videos, and vision-language-action (VLA) models, all using the same policy. Scaling motion tracking exhibits favorable properties: performance improves steadily with increased compute and data diversity, and learned representations generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.
Links¶
Primary
Standard
Alternate
Tags¶
-
Robotics
-
Humanoid robots
-
Whole-body control
-
Motion tracking
-
Robot foundation models
-
Reinforcement learning
-
Motion capture
-
Teleoperation
-
Sim-to-real transfer
-
Embodied AI