Ego-Body Pose Estimation via Ego-Head Pose Estimation

Estimating 3D human motion from an egocentric video sequence plays a critical role in human behavior understanding and has various applications in VR/AR. However, naively learning a mapping between egocentric videos and human motions is challenging, because the user's body is often unobserved by the front-facing camera placed on the head of the user. In addition, collecting large-scale, high-quality datasets with paired egocentric videos and 3D human motions requires accurate motion capture devices, which often limit the variety of scenes in the videos to lab-like environments. To eliminate the need for paired egocentric video and human motions, we propose a new method, Ego-Body Pose Estimation via Ego-Head Pose Estimation (EgoEgo), which decomposes the problem into two stages, connected by the head motion as an intermediate representation. EgoEgo first integrates SLAM and a learning approach to estimate accurate head motion. Subsequently, leveraging the estimated head pose as input, EgoEgo utilizes conditional diffusion to generate multiple plausible full-body motions. This disentanglement of head and body pose eliminates the need for training datasets with paired egocentric videos and 3D human motion, enabling us to leverage large-scale egocentric video datasets and motion capture datasets separately. Moreover, for systematic benchmarking, we develop a synthetic dataset, AMASS-Replica-Ego-Syn (ARES), with paired egocentric videos and human motion. On both ARES and real data, our EgoEgo model performs significantly better than the current state-of-the-art methods.

[1]  Amit H. Bermano,et al.  Human Motion Diffusion Model , 2022, ICLR.

[2]  Alexander W. Winkler,et al.  QuestSim: Human Motion Tracking from Sparse Sensors with Simulated Avatars , 2022, SIGGRAPH Asia.

[3]  Sungjoon Choi,et al.  FLAME: Free-form Language-based Motion Synthesis & Editing , 2022, AAAI.

[4]  Zhongang Cai,et al.  MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Christian Holz,et al.  AvatarPoser: Articulated Full-Body Pose Tracking from Sparse Motion Sensing , 2022, ECCV.

[6]  Jun Saito,et al.  NeMF: Neural Motion Fields for Kinematic Animation , 2022, NeurIPS.

[7]  Tao Yu,et al.  GIMO: Gaze-Informed Human Motion Prediction in Context , 2022, ECCV.

[8]  C. Theobalt,et al.  Physical Inertial Poser (PIP): Physics-aware Real-time Human Motion Tracking from Sparse Inertial Sensors , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  A. Fitzgibbon,et al.  FLAG: Flow-based 3D Avatar Generation from Sparse Observations , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  J. Kautz,et al.  GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  C. K. Liu,et al.  Transformer Inertial Poser: Attention-based Real-time Human Motion Reconstruction from Sparse IMUs , 2022, ArXiv.

[13]  J. Shotton,et al.  Full-Body Motion from a Single Head-Mounted Device: Generating SMPL Poses from Partial Observations , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Michael J. Black,et al.  SPEC: Seeing People in the Wild with an Estimated Camera , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  S. Fidler,et al.  Physics-based Human Motion Estimation and Synthesis from Videos , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Jia Deng,et al.  DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras , 2021, NeurIPS.

[17]  Angel X. Chang,et al.  Habitat 2.0: Training Home Assistants to Rearrange their Habitat , 2021, NeurIPS.

[18]  Kris Kitani,et al.  Dynamics-Regulated Kinematic Policy for Egocentric Pose Estimation , 2021, NeurIPS.

[19]  Ruben Villegas,et al.  Task-Generic Hierarchical Human Motion Prior using VAEs , 2021, 2021 International Conference on 3D Vision (3DV).

[20]  Leonidas J. Guibas,et al.  HuMoR: 3D Human Motion Model for Robust Pose Estimation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Yuxiao Zhou,et al.  TransPose , 2021, ACM Trans. Graph..

[22]  Christian Theobalt,et al.  Estimating Egocentric 3D Human Pose in Global Space , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Michael J. Black,et al.  PARE: Part Attention Regressor for 3D Human Body Estimation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  V. Ithapu,et al.  Egocentric Pose Estimation from Human Vision Span , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Kris Kitani,et al.  SimPoE: Simulated Character Control for 3D Human Pose Estimation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Sung-Hee Lee,et al.  LoBSTr: Real‐time Lower‐body Pose Prediction from Sparse Upper‐body Tracking Signals , 2021, Comput. Graph. Forum.

[27]  X. Wang,et al.  Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Kyoung Mu Lee,et al.  Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Carlos Campos,et al.  ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM , 2020, IEEE Transactions on Robotics.

[30]  Zhengyi Luo,et al.  3D Human Motion Estimation via Motion Compression and Refinement , 2020, ACCV.

[31]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[32]  Michael J. Black,et al.  VIBE: Video Inference for Human Body Pose and Shape Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Kristen Grauman,et al.  You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Michael J. Black,et al.  Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Lourdes Agapito,et al.  xR-EgoPose: Egocentric 3D Human Pose From an HMD Camera , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Michael Goesele,et al.  The Replica Dataset: A Digital Replica of Indoor Spaces , 2019, ArXiv.

[37]  Kris Kitani,et al.  Ego-Pose Estimation and Forecasting As Real-Time PD Control , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Yi Zhou,et al.  On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Pascal Fua,et al.  Mo2Cap2: Real-time Mobile 3D Motion Capture with a Cap-mounted Fisheye Camera , 2018, IEEE Transactions on Visualization and Computer Graphics.

[43]  Kris M. Kitani,et al.  3D Ego-Pose Estimation via Imitation Learning , 2018, ECCV.

[44]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Ersin Yumer,et al.  Self-supervised Learning of Motion Capture , 2017, NIPS.

[46]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[47]  Hans-Peter Seidel,et al.  VNect , 2017, ACM Trans. Graph..

[48]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Kristen Grauman,et al.  Seeing Invisible Poses: Estimating 3D Body Pose from Egocentric Video , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Xiaowei Zhou,et al.  Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[53]  E. Keshner,et al.  Motor control strategies underlying head stabilization and voluntary head movements in humans and cats. , 1988, Progress in brain research.