论文信息 - Ego-Body Pose Estimation via Ego-Head Pose Estimation

Ego-Body Pose Estimation via Ego-Head Pose Estimation

Estimating 3D human motion from an egocentric video sequence plays a critical role in human behavior understanding and has various applications in VR/AR. However, naively learning a mapping between egocentric videos and human motions is challenging, because the user's body is often unobserved by the front-facing camera placed on the head of the user. In addition, collecting large-scale, high-quality datasets with paired egocentric videos and 3D human motions requires accurate motion capture devices, which often limit the variety of scenes in the videos to lab-like environments. To eliminate the need for paired egocentric video and human motions, we propose a new method, Ego-Body Pose Estimation via Ego-Head Pose Estimation (EgoEgo), which decomposes the problem into two stages, connected by the head motion as an intermediate representation. EgoEgo first integrates SLAM and a learning approach to estimate accurate head motion. Subsequently, leveraging the estimated head pose as input, EgoEgo utilizes conditional diffusion to generate multiple plausible full-body motions. This disentanglement of head and body pose eliminates the need for training datasets with paired egocentric videos and 3D human motion, enabling us to leverage large-scale egocentric video datasets and motion capture datasets separately. Moreover, for systematic benchmarking, we develop a synthetic dataset, AMASS-Replica-Ego-Syn (ARES), with paired egocentric videos and human motion. On both ARES and real data, our EgoEgo model performs significantly better than the current state-of-the-art methods.

C. K. Liu | Jiajun Wu | Jiaman Li | C. Liu

[1] Amit H. Bermano,et al. Human Motion Diffusion Model , 2022, ICLR.

[2] Alexander W. Winkler,et al. QuestSim: Human Motion Tracking from Sparse Sensors with Simulated Avatars , 2022, SIGGRAPH Asia.

[3] Sungjoon Choi,et al. FLAME: Free-form Language-based Motion Synthesis & Editing , 2022, AAAI.

[4] Zhongang Cai,et al. MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5] Christian Holz,et al. AvatarPoser: Articulated Full-Body Pose Tracking from Sparse Motion Sensing , 2022, ECCV.

[6] Jun Saito,et al. NeMF: Neural Motion Fields for Kinematic Animation , 2022, NeurIPS.

[7] Tao Yu,et al. GIMO: Gaze-Informed Human Motion Prediction in Context , 2022, ECCV.

[8] C. Theobalt,et al. Physical Inertial Poser (PIP): Physics-aware Real-time Human Motion Tracking from Sparse Inertial Sensors , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] A. Fitzgibbon,et al. FLAG: Flow-based 3D Avatar Generation from Sparse Observations , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] J. Kautz,et al. GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] C. K. Liu,et al. Transformer Inertial Poser: Attention-based Real-time Human Motion Reconstruction from Sparse IMUs , 2022, ArXiv.

[13] J. Shotton,et al. Full-Body Motion from a Single Head-Mounted Device: Generating SMPL Poses from Partial Observations , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14] Michael J. Black,et al. SPEC: Seeing People in the Wild with an Estimated Camera , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15] S. Fidler,et al. Physics-based Human Motion Estimation and Synthesis from Videos , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16] Jia Deng,et al. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras , 2021, NeurIPS.

[17] Angel X. Chang,et al. Habitat 2.0: Training Home Assistants to Rearrange their Habitat , 2021, NeurIPS.

[18] Kris Kitani,et al. Dynamics-Regulated Kinematic Policy for Egocentric Pose Estimation , 2021, NeurIPS.

[19] Ruben Villegas,et al. Task-Generic Hierarchical Human Motion Prior using VAEs , 2021, 2021 International Conference on 3D Vision (3DV).

[20] Leonidas J. Guibas,et al. HuMoR: 3D Human Motion Model for Robust Pose Estimation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21] Yuxiao Zhou,et al. TransPose , 2021, ACM Trans. Graph..

[22] Christian Theobalt,et al. Estimating Egocentric 3D Human Pose in Global Space , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23] Michael J. Black,et al. PARE: Part Attention Regressor for 3D Human Body Estimation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24] V. Ithapu,et al. Egocentric Pose Estimation from Human Vision Span , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25] Kris Kitani,et al. SimPoE: Simulated Character Control for 3D Human Pose Estimation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Sung-Hee Lee,et al. LoBSTr: Real‐time Lower‐body Pose Prediction from Sparse Upper‐body Tracking Signals , 2021, Comput. Graph. Forum.

[27] X. Wang,et al. Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Kyoung Mu Lee,et al. Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Carlos Campos,et al. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM , 2020, IEEE Transactions on Robotics.

[30] Zhengyi Luo,et al. 3D Human Motion Estimation via Motion Compression and Refinement , 2020, ACCV.

[31] Pieter Abbeel,et al. Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[32] Michael J. Black,et al. VIBE: Video Inference for Human Body Pose and Shape Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Kristen Grauman,et al. You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Michael J. Black,et al. Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35] Lourdes Agapito,et al. xR-EgoPose: Egocentric 3D Human Pose From an HMD Camera , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36] Michael Goesele,et al. The Replica Dataset: A Digital Replica of Indoor Spaces , 2019, ArXiv.

[37] Kris Kitani,et al. Ego-Pose Estimation and Forecasting As Real-Time PD Control , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38] Dimitrios Tzionas,et al. Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Nikolaus F. Troje,et al. AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40] Jitendra Malik,et al. Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41] Yi Zhou,et al. On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Pascal Fua,et al. Mo2Cap2: Real-time Mobile 3D Motion Capture with a Cap-mounted Fisheye Camera , 2018, IEEE Transactions on Visualization and Computer Graphics.

[43] Kris M. Kitani,et al. 3D Ego-Pose Estimation via Imitation Learning , 2018, ECCV.

[44] Jitendra Malik,et al. End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45] Ersin Yumer,et al. Self-supervised Learning of Motion Capture , 2017, NIPS.

[46] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[47] Hans-Peter Seidel,et al. VNect , 2017, ACM Trans. Graph..

[48] Xiaowei Zhou,et al. Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Kristen Grauman,et al. Seeing Invisible Poses: Estimating 3D Body Pose from Egocentric Video , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Xiaowei Zhou,et al. Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Michael J. Black,et al. SMPL: A Skinned Multi-Person Linear Model , 2023 .

[53] E. Keshner,et al. Motor control strategies underlying head stabilization and voluntary head movements in humans and cats. , 1988, Progress in brain research.