4D Human Body Capture from Egocentric Video via 3D Scene Grounding

We introduce a novel task of reconstructing a time series of second-person1 3D human body meshes from monocular egocentric videos. The unique viewpoint and rapid embodied camera motion of egocentric videos raise additional technical barriers for human body capture. To address those challenges, we propose a simple yet effective optimization-based approach that leverages 2D observations of the entire video sequence and human-scene interaction constraint to estimate second-person human poses, shapes, and global motion that are grounded on the 3D environment captured from the egocentric view. We conduct detailed ablation studies to validate our design choice. Moreover, we compare our method with the previous state-of-the-art method on human motion capture from monocular video, and show that our method estimates more accurate human-body poses and shapes under the challenging egocentric setting. In addition, we demonstrate that our approach produces more realistic human-scene interaction.

[1]  James M. Rehg,et al.  Ego4D: Around the World in 3,000 Hours of Egocentric Video , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Dimitrios Tzionas,et al.  Embodied Hands: Modeling and Capturing Hands and Bodies Together , 2022, ArXiv.

[3]  James M. Rehg,et al.  Egocentric Activity Recognition and Localization on a 3D Map , 2021, ArXiv.

[4]  Torsten Sattler,et al.  Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  James M. Rehg,et al.  In the Eye of the Beholder: Gaze and Actions in First Person Video , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Christian Theobalt,et al.  PhysCap , 2020, ACM Trans. Graph..

[7]  David F. Fouhey,et al.  Full-Body Awareness from Partial Observations , 2020, ECCV.

[8]  Michael J. Black,et al.  Generating Person-Scene Interactions in 3D Scenes , 2020, ArXiv.

[9]  Deva Ramanan,et al.  Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild , 2020, ECCV.

[10]  A. Tyagi,et al.  PoseNet3D: Learning Temporally Consistent 3D Human Pose via Knowledge Distillation , 2020, 2020 International Conference on 3D Vision (3DV).

[11]  Eunji Chong,et al.  Detecting Attended Visual Targets in Video , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  K. Grauman,et al.  Ego-Topo: Environment Affordances From Egocentric Video , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Michael J. Black,et al.  VIBE: Video Inference for Human Body Pose and Shape Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Michael J. Black,et al.  Generating 3D People in Scenes Without People , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  James M. Rehg,et al.  Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video , 2019, ECCV.

[16]  Kristen Grauman,et al.  You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Michael J. Black,et al.  Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Dimitrios Tzionas,et al.  Resolving 3D Human Pose Ambiguities With 3D Scene Constraints , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Lourdes Agapito,et al.  xR-EgoPose: Egocentric 3D Human Pose From an HMD Camera , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Giovanni Maria Farinella,et al.  What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Andrew Zisserman,et al.  Exploiting Temporal Context for 3D Human Pose Estimation in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Nicolas Mansard,et al.  Estimating 3D Motion and Forces of Person-Object Interactions From Monocular Video , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Kristen Grauman,et al.  Grounded Human-Object Interaction Hotspots From Video , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Jitendra Malik,et al.  Learning 3D Human Dynamics From Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Michael J. Black,et al.  Resolving 3 D Human Pose Ambiguities with 3 D Scene Constraints , 2019 .

[29]  Kris M. Kitani,et al.  3D Ego-Pose Estimation via Imitation Learning , 2018, ECCV.

[30]  James M. Rehg,et al.  In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video , 2018, ECCV.

[31]  Bodo Rosenhahn,et al.  Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[32]  James M. Rehg,et al.  Connecting Gaze, Scene, and Attention: Generalized Attention Estimation via Joint Modeling of Gaze and Scene Saliency , 2018, ECCV.

[33]  Qionghai Dai,et al.  Outdoor Markerless Motion Capture with Sparse Handheld Video Cameras , 2018, IEEE Transactions on Visualization and Computer Graphics.

[34]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[35]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Yoichi Sato,et al.  Future Person Localization in First-Person Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Abhishek Sharma,et al.  Learning 3D Human Pose from Structure and Motion , 2017, ECCV.

[38]  Yichen Wei,et al.  Compositional Human Pose Regression , 2018, Comput. Vis. Image Underst..

[39]  Ersin Yumer,et al.  Self-supervised Learning of Motion Capture , 2017, NIPS.

[40]  Yinghao Huang,et al.  Towards Accurate Marker-Less Human Shape and Pose Estimation over Time , 2017, 2017 International Conference on 3D Vision (3DV).

[41]  Abhinav Gupta,et al.  Binge Watching: Scaling Affordance Learning from Sitcoms , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[43]  Bodo Rosenhahn,et al.  Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs , 2017, Comput. Graph. Forum.

[44]  Bodo Rosenhahn,et al.  Optical Flow-Based 3D Human Motion Estimation from Monocular Video , 2017, GCPR.

[45]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Kristen Grauman,et al.  Seeing Invisible Poses: Estimating 3D Body Pose from Egocentric Video , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Matthias Nießner,et al.  PiGraphs: learning interaction snapshots from observations , 2016, ACM Trans. Graph..

[48]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[49]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Yoichi Sato,et al.  Recognizing Micro-Actions and Reactions from Paired Egocentric Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Jianbo Shi,et al.  Egocentric Future Localization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Kristen Grauman,et al.  Subjects and Their Objects: Localizing Interactees for a Person-Centric View of Importance , 2016, International Journal of Computer Vision.

[53]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Shmuel Peleg,et al.  Compact CNN for indexing egocentric videos , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[55]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  James M. Rehg,et al.  Delving into egocentric actions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Jianbo Shi,et al.  Social saliency prediction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  James M. Rehg,et al.  Detecting bids for eye contact using a wearable camera , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[59]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[60]  Mohan S. Kankanhalli,et al.  Action and Interaction Recognition in First-Person Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[61]  Larry H. Matthies,et al.  First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  Alexei A. Efros,et al.  Scene Semantics from Long-Term Observation of People , 2012, ECCV.

[63]  James M. Rehg,et al.  Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[65]  Alexei A. Efros,et al.  From 3D scene geometry to human workspace , 2011, CVPR 2011.

[66]  Luc Van Gool,et al.  What makes a chair a chair? , 2011, CVPR 2011.

[67]  Dragomir Anguelov,et al.  SCAPE: shape completion and animation of people , 2005, ACM Trans. Graph..

[68]  Stuart Geman,et al.  Statistical methods for tomographic image reconstruction , 1987 .