EgoBody: Human Body Shape, Motion and Social Interactions from Head-Mounted Devices

Understanding social interactions from first-person views is crucial for many applications, ranging from assistive robotics to AR/VR. A first step for reasoning about interactions is to understand human pose and shape. However, research in this area is currently hindered by the lack of data. Existing datasets are limited in terms of either size, annotations, ground-truth capture modalities or the diversity of interactions. We address this shortcoming by proposing EgoBody, a novel large-scale dataset for social interactions in complex 3D scenes. We employ Microsoft HoloLens2 headsets to record rich egocentric data streams (including RGB, depth, eye gaze, head and hand tracking). To obtain accurate 3D ground-truth, we calibrate the headset with a multi-Kinect rig and fit expressive SMPL-X body meshes to multi-view RGB-D frames, reconstructing 3D human poses and shapes relative to the scene. We collect 68 sequences, spanning diverse sociological interaction categories, and propose the first benchmark for 3D full-body pose and shape estimation from egocentric views. Our dataset and code will be available for research at https: //sanweiliti.github.io/egobody/egobody.html.

[1]  Kostas Daniilidis,et al.  Convolutional Mesh Regression for Single-Image Human Shape Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Interaction Capture , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Michael J. Black,et al.  Markerless Outdoor Human Motion Capture Using Multiple Autonomous Micro Aerial Vehicles , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[5]  Bodo Rosenhahn,et al.  Human Pose Estimation from Video and IMUs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Peter V. Gehler,et al.  Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation , 2018, 2018 International Conference on 3D Vision (3DV).

[7]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Zhengyi Luo,et al.  3D Human Motion Estimation via Motion Compression and Refinement , 2020, ACCV.

[9]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Jie Song,et al.  Human Body Model Fitting by Learned Gradient Descent , 2020, ECCV.

[11]  Hujun Bao,et al.  Motion Capture from Internet Videos , 2020, ECCV.

[12]  Lijuan Wang,et al.  End-to-End Human Pose and Mesh Reconstruction with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Marc Pollefeys,et al.  Learning Motion Priors for 4D Human Body Capture in 3D Scenes , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Mohan S. Kankanhalli,et al.  Action and Interaction Recognition in First-Person Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[15]  James M. Rehg,et al.  In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video , 2018, ECCV.

[16]  Cristian Sminchisescu,et al.  Weakly Supervised 3D Human Pose and Shape Reconstruction with Normalizing Flows , 2020, ECCV.

[17]  J. Gower Generalized procrustes analysis , 1975 .

[18]  Michael J. Black,et al.  The Naked Truth: Estimating Body Shape Under Clothing , 2008, ECCV.

[19]  Yinghao Huang,et al.  Towards Accurate Marker-Less Human Shape and Pose Estimation over Time , 2017, 2017 International Conference on 3D Vision (3DV).

[20]  Larry H. Matthies,et al.  First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  The social bond;: An introduction to the study of society , 1970 .

[22]  Lourdes Agapito,et al.  xR-EgoPose: Egocentric 3D Human Pose From an HMD Camera , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Song-Chun Zhu,et al.  DenseRaC: Joint 3D Pose and Shape Estimation by Dense Render-and-Compare , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Jesús Chamorro-Martínez,et al.  Diatom autofocusing in brightfield microscopy: a comparative study , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[25]  Yaser Sheikh,et al.  Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Liu Wu,et al.  Human Mesh Recovery From Monocular Images via a Skeleton-Disentangled Representation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Jihun Yu,et al.  HUMBI: A Large Multiview Dataset of Human Body Expressions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Iasonas Kokkinos,et al.  HoloPose: Holistic 3D Human Reconstruction In-The-Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Marc Pollefeys,et al.  HoloLens 2 Research Mode as a Tool for Computer Vision Research , 2020, ArXiv.

[30]  Yaser Sheikh,et al.  Towards Social Artificial Intelligence: Nonverbal Social Signal Prediction in a Triadic Interaction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Ignas Budvytis,et al.  Indirect deep structured learning for 3D human body shape and pose prediction , 2017, BMVC.

[32]  Ersin Yumer,et al.  Self-supervised Learning of Motion Capture , 2017, NIPS.

[33]  Cristian Sminchisescu,et al.  Three-Dimensional Reconstruction of Human Interactions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Dongdong Yu,et al.  Body Meshes as Points , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Tao Yu,et al.  4D Association Graph for Realtime Multi-Person Motion Capture Using Multiple Video Cameras , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Cordelia Schmid,et al.  Actor and Observer: Joint Modeling of First and Third-Person Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Ankur Agarwal,et al.  Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Dimitrios Tzionas,et al.  Resolving 3D Human Pose Ambiguities With 3D Scene Constraints , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Hans-Peter Seidel,et al.  Optimization and Filtering for Human Motion Capture , 2010, International Journal of Computer Vision.

[41]  James M. Rehg,et al.  Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Lourdes Agapito,et al.  SelfPose: 3D Egocentric Pose Estimation From a Headset Mounted Camera , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Cewu Lu,et al.  HybrIK: A Hybrid Analytical-Neural Inverse Kinematics Solution for 3D Human Pose and Shape Estimation , 2020, ArXiv.

[45]  Michael J. Black,et al.  PARE: Part Attention Regressor for 3D Human Body Estimation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Takaaki Shiratori,et al.  FrankMocap: A Monocular 3D Whole-Body Pose Estimation System via Regression and Integration , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[47]  Andrea Vedaldi,et al.  Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D Human Pose Estimation , 2020, 2021 International Conference on 3D Vision (3DV).

[48]  Kristen Grauman,et al.  You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Paul J. Besl,et al.  A Method for Registration of 3-D Shapes , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[50]  Trevor Darrell,et al.  Inferring 3D structure with a statistical image-based shape model , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[51]  Kyoung Mu Lee,et al.  I2L-MeshNet: Image-to-Lixel Prediction Network for Accurate 3D Human Pose and Mesh Estimation from a Single RGB Image , 2020, ECCV.

[52]  Bodo Rosenhahn,et al.  CanonPose: Self-Supervised Monocular 3D Human Pose Estimation in the Wild , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Torsten Sattler,et al.  Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Qionghai Dai,et al.  Outdoor Markerless Motion Capture with Sparse Handheld Video Cameras , 2018, IEEE Transactions on Visualization and Computer Graphics.

[55]  Joachim Tesch,et al.  AGORA: Avatars in Geography Optimized for Regression Analysis , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[57]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[58]  James M. Rehg,et al.  Detecting bids for eye contact using a wearable camera , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[59]  Kris Kitani,et al.  SimPoE: Simulated Character Control for 3D Human Pose Estimation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[62]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[63]  Kostas Daniilidis,et al.  Probabilistic Modeling for Human Mesh Recovery , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[64]  Christian Theobalt,et al.  Monocular Real-time Full Body Capture with Inter-part Correlations , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Bodo Rosenhahn,et al.  Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[66]  Charles Malleson,et al.  Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors , 2017, BMVC.

[67]  Michael J. Black,et al.  VIBE: Video Inference for Human Body Pose and Shape Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Yaser Sheikh,et al.  Monocular Total Capture: Posing Face, Body, and Hands in the Wild , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Hujun Bao,et al.  Reconstructing 3D Human Pose by Watching Humans in the Mirror , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  James M. Rehg,et al.  Ego4D: Around the World in 3,000 Hours of Egocentric Video , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Kris M. Kitani,et al.  Kinematics-Guided Reinforcement Learning for Object-Aware 3D Ego-Pose Estimation , 2020, ArXiv.

[72]  Michael J. Black,et al.  Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[73]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[74]  Yoichi Sato,et al.  Recognizing Micro-Actions and Reactions from Paired Egocentric Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Joachim Tesch,et al.  SPEC: Seeing People in the Wild with an Estimated Camera , 2021, ArXiv.

[76]  Kyoung Mu Lee,et al.  Pose2Mesh: Graph Convolutional Network for 3D Human Pose and Mesh Recovery from a 2D Human Pose , 2020, ECCV.

[77]  Pascal Fua,et al.  Mo2Cap2: Real-time Mobile 3D Motion Capture with a Cap-mounted Fisheye Camera , 2018, IEEE Transactions on Visualization and Computer Graphics.

[78]  Kris Kitani,et al.  Ego-Pose Estimation and Forecasting As Real-Time PD Control , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[79]  Serena Yeung,et al.  Holistic 3D Human and Scene Mesh Estimation from Single View Images , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Marc Pollefeys,et al.  H2O: Two Hands Manipulating Objects for First Person Interaction Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[81]  Kristen Grauman,et al.  Seeing Invisible Poses: Estimating 3D Body Pose from Egocentric Video , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  James M. Rehg,et al.  4D Human Body Capture from Egocentric Video via 3D Scene Grounding , 2020, 2021 International Conference on 3D Vision (3DV).

[83]  Kyoung Mu Lee,et al.  Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Jitendra Malik,et al.  Learning 3D Human Dynamics From Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Francesc Moreno-Noguer,et al.  Human Motion Prediction via Spatio-Temporal Inpainting , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).