You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions

The body pose of a person wearing a camera is of great interest for applications in augmented reality, healthcare, and robotics, yet much of the person's body is out of view for a typical wearable camera. We propose a learning-based approach to estimate the camera wearer's 3D body pose from egocentric video sequences. Our key insight is to leverage interactions with another person---whose body pose we can directly observe---as a signal inherently linked to the body pose of the first-person subject. We show that since interactions between individuals often induce a well-ordered series of back-and-forth responses, it is possible to learn a temporal model of the interlinked poses even though one party is largely out of view. We demonstrate our idea on a variety of domains with dyadic interaction and show the substantial impact on egocentric body pose estimation, which improves the state of the art.

[1]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Jianbo Shi,et al.  Social saliency prediction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Luc Van Gool,et al.  Two-Stream SR-CNNs for Action Recognition in Videos , 2016, BMVC.

[4]  Kris M. Kitani,et al.  Action-Reaction: Forecasting the Dynamics of Human Interaction , 2014, ECCV.

[5]  Kris Kitani,et al.  Ego-Pose Estimation and Forecasting As Real-Time PD Control , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Cordelia Schmid,et al.  LCR-Net: Localization-Classification-Regression for Human Pose , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Cordelia Schmid,et al.  P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Francesco Solera,et al.  From Ego to Nos-Vision: Detecting Social Relationships in First-Person Views , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[10]  Yaser Sheikh,et al.  Motion capture from body-mounted cameras , 2011, ACM Trans. Graph..

[11]  Luc Van Gool,et al.  Improving Data Association by Joint Modeling of Pedestrian Trajectories and Groupings , 2010, ECCV.

[12]  Alexandros Stergiou,et al.  Understanding human-human interactions: a survey , 2018, ArXiv.

[13]  James M. Rehg,et al.  Detecting eye contact using wearable eye-tracking glasses , 2012, UbiComp.

[14]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Stefan Lee,et al.  Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Simone Calderara,et al.  Understanding social relationships in egocentric vision , 2015, Pattern Recognit..

[17]  Adrien Treuille,et al.  Continuum crowds , 2006, SIGGRAPH 2006.

[18]  Deva Ramanan,et al.  First-person pose recognition using egocentric workspaces , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Dima Damen,et al.  You-Do, I-Learn: Discovering Task Relevant Objects and their Modes of Interaction from Multi-User Egocentric Video , 2014, BMVC.

[20]  Jitendra Malik,et al.  Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Larry H. Matthies,et al.  First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Greg Mori,et al.  Structure Inference Machines: Recurrent Neural Networks for Analyzing Relations in Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  Shmuel Peleg,et al.  Compact CNN for indexing egocentric videos , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26]  Cheng Li,et al.  Pixel-Level Hand Detection in Ego-centric Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[28]  Louis-Philippe Morency,et al.  Modeling Latent Discriminative Dynamic of Multi-dimensional Affective Signals , 2011, ACII.

[29]  Song-Chun Zhu,et al.  Joint action recognition and pose estimation from video , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Martial Hebert,et al.  Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[31]  Kris M. Kitani,et al.  3D Ego-Pose Estimation via Imitation Learning , 2018, ECCV.

[32]  Ioannis A. Kakadiaris,et al.  3D Human pose estimation: A review of the literature and analysis of covariates , 2016, Comput. Vis. Image Underst..

[33]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Antonio Origlia,et al.  From Nonverbal Cues to Perception: Personality and Social Attractiveness , 2011, COST 2102 Training School.

[35]  Gang Wang,et al.  Skeleton-Based Online Action Prediction Using Scale Selection Network , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Kristen Grauman,et al.  Seeing Invisible Poses: Estimating 3D Body Pose from Egocentric Video , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[38]  Larry H. Matthies,et al.  Pooled motion features for first-person videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Bernt Schiele,et al.  DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model , 2016, ECCV.

[40]  Silvio Savarese,et al.  Social Scene Understanding: End-to-End Multi-person Action Localization and Collective Activity Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Shimon Ullman,et al.  Human Pose Estimation Using Deep Consensus Voting , 2016, ECCV.

[42]  Louis-Philippe Morency,et al.  Modeling Human Communication Dynamics [Social Sciences] , 2010, IEEE Signal Processing Magazine.

[43]  Louis-Philippe Morency,et al.  Modeling Human Communication Dynamics , 2010 .

[44]  Yi-Ping Hung,et al.  Recognizing Human Actions with Outlier Frames by Observation Filtering and Completion , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[45]  Alexandros Stergiou,et al.  Analyzing human-human interactions: A survey , 2019, Comput. Vis. Image Underst..

[46]  Xiaogang Wang,et al.  Learning Feature Pyramids for Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47]  Jianbo Shi,et al.  Egocentric Future Localization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Omar Ait-Aider,et al.  Rolling Shutter Pose and Ego-Motion Estimation Using Shape-from-Template , 2018, ECCV.

[49]  Fei-Fei Li,et al.  Socially-Aware Large-Scale Crowd Forecasting , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Peter V. Gehler,et al.  DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Yichen Wei,et al.  Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[53]  Yichen Wei,et al.  Simple Baselines for Human Pose Estimation and Tracking , 2018, ECCV.

[54]  Hans-Peter Seidel,et al.  EgoCap , 2016, ACM Trans. Graph..

[55]  Kristen Grauman,et al.  Object-Centric Spatio-Temporal Pyramids for Egocentric Activity Recognition , 2013, BMVC.

[56]  Jianbo Shi,et al.  Egocentric Basketball Motion Planning from a Single First-Person Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Cheng Li,et al.  Model Recommendation with Virtual Probes for Egocentric Hand Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[59]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[60]  Yoichi Sato,et al.  Recognizing Micro-Actions and Reactions from Paired Egocentric Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62]  Yaser Sheikh,et al.  Monocular Total Capture: Posing Face, Body, and Hands in the Wild , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[64]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[65]  Frank J. Bernieri,et al.  Synchrony, pseudosynchrony, and dissynchrony: Measuring the entrainment process in mother-infant interactions. , 1988 .

[66]  Jiaxuan Wang,et al.  HICO: A Benchmark for Recognizing Human-Object Interactions in Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[67]  Juergen Gall,et al.  PoseTrack: Joint Multi-person Pose Estimation and Tracking , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Lorenzo Torresani,et al.  Detect-and-Track: Efficient Pose Estimation in Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[69]  Mark Everingham,et al.  Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation , 2010, BMVC.

[70]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[71]  James M. Rehg,et al.  Detecting bids for eye contact using a wearable camera , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[72]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[73]  James M. Rehg,et al.  Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.