论文信息 - You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions

You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions

The body pose of a person wearing a camera is of great interest for applications in augmented reality, healthcare, and robotics, yet much of the person's body is out of view for a typical wearable camera. We propose a learning-based approach to estimate the camera wearer's 3D body pose from egocentric video sequences. Our key insight is to leverage interactions with another person---whose body pose we can directly observe---as a signal inherently linked to the body pose of the first-person subject. We show that since interactions between individuals often induce a well-ordered series of back-and-forth responses, it is possible to learn a temporal model of the interlinked poses even though one party is largely out of view. We demonstrate our idea on a variety of domains with dyadic interaction and show the substantial impact on egocentric body pose estimation, which improves the state of the art.

[1] Bernt Schiele,et al. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2] Jianbo Shi,et al. Social saliency prediction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Luc Van Gool,et al. Two-Stream SR-CNNs for Action Recognition in Videos , 2016, BMVC.

[4] Kris M. Kitani,et al. Action-Reaction: Forecasting the Dynamics of Human Interaction , 2014, ECCV.

[5] Kris Kitani,et al. Ego-Pose Estimation and Forecasting As Real-Time PD Control , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6] Jitendra Malik,et al. End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7] Cordelia Schmid,et al. LCR-Net: Localization-Classification-Regression for Human Pose , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Cordelia Schmid,et al. P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9] Francesco Solera,et al. From Ego to Nos-Vision: Detecting Social Relationships in First-Person Views , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[10] Yaser Sheikh,et al. Motion capture from body-mounted cameras , 2011, ACM Trans. Graph..

[11] Luc Van Gool,et al. Improving Data Association by Joint Modeling of Pedestrian Trajectories and Groupings , 2010, ECCV.

[12] Alexandros Stergiou,et al. Understanding human-human interactions: a survey , 2018, ArXiv.

[13] James M. Rehg,et al. Detecting eye contact using wearable eye-tracking glasses , 2012, UbiComp.

[14] Christian Szegedy,et al. DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15] Stefan Lee,et al. Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16] Simone Calderara,et al. Understanding social relationships in egocentric vision , 2015, Pattern Recognit..

[17] Adrien Treuille,et al. Continuum crowds , 2006, SIGGRAPH 2006.

[18] Deva Ramanan,et al. First-person pose recognition using egocentric workspaces , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Dima Damen,et al. You-Do, I-Learn: Discovering Task Relevant Objects and their Modes of Interaction from Multi-User Egocentric Video , 2014, BMVC.

[20] Jitendra Malik,et al. Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Larry H. Matthies,et al. First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[22] Kris M. Kitani,et al. Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Greg Mori,et al. Structure Inference Machines: Recurrent Neural Networks for Analyzing Relations in Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] James M. Rehg,et al. Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[25] Shmuel Peleg,et al. Compact CNN for indexing egocentric videos , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26] Cheng Li,et al. Pixel-Level Hand Detection in Ego-centric Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[27] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[28] Louis-Philippe Morency,et al. Modeling Latent Discriminative Dynamic of Multi-dimensional Affective Signals , 2011, ACII.

[29] Song-Chun Zhu,et al. Joint action recognition and pose estimation from video , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Martial Hebert,et al. Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[31] Kris M. Kitani,et al. 3D Ego-Pose Estimation via Imitation Learning , 2018, ECCV.

[32] Ioannis A. Kakadiaris,et al. 3D Human pose estimation: A review of the literature and analysis of covariates , 2016, Comput. Vis. Image Underst..

[33] James J. Little,et al. A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34] Antonio Origlia,et al. From Nonverbal Cues to Perception: Personality and Social Attractiveness , 2011, COST 2102 Training School.

[35] Gang Wang,et al. Skeleton-Based Online Action Prediction Using Scale Selection Network , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36] Kristen Grauman,et al. Seeing Invisible Poses: Estimating 3D Body Pose from Egocentric Video , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Bernhard P. Wrobel,et al. Multiple View Geometry in Computer Vision , 2001 .

[38] Larry H. Matthies,et al. Pooled motion features for first-person videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Bernt Schiele,et al. DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model , 2016, ECCV.

[40] Silvio Savarese,et al. Social Scene Understanding: End-to-End Multi-person Action Localization and Collective Activity Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Shimon Ullman,et al. Human Pose Estimation Using Deep Consensus Voting , 2016, ECCV.

[42] Louis-Philippe Morency,et al. Modeling Human Communication Dynamics [Social Sciences] , 2010, IEEE Signal Processing Magazine.

[43] Louis-Philippe Morency,et al. Modeling Human Communication Dynamics , 2010 .

[44] Yi-Ping Hung,et al. Recognizing Human Actions with Outlier Frames by Observation Filtering and Completion , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[45] Alexandros Stergiou,et al. Analyzing human-human interactions: A survey , 2019, Comput. Vis. Image Underst..

[46] Xiaogang Wang,et al. Learning Feature Pyramids for Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47] Jianbo Shi,et al. Egocentric Future Localization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Omar Ait-Aider,et al. Rolling Shutter Pose and Ego-Motion Estimation Using Shape-from-Template , 2018, ECCV.

[49] Fei-Fei Li,et al. Socially-Aware Large-Scale Crowd Forecasting , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50] Silvio Savarese,et al. Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Peter V. Gehler,et al. DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Yichen Wei,et al. Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[53] Yichen Wei,et al. Simple Baselines for Human Pose Estimation and Tracking , 2018, ECCV.

[54] Hans-Peter Seidel,et al. EgoCap , 2016, ACM Trans. Graph..

[55] Kristen Grauman,et al. Object-Centric Spatio-Temporal Pyramids for Egocentric Activity Recognition , 2013, BMVC.

[56] Jianbo Shi,et al. Egocentric Basketball Motion Planning from a Single First-Person Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58] Cheng Li,et al. Model Recommendation with Virtual Probes for Egocentric Hand Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[59] Ali Farhadi,et al. Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[60] Yoichi Sato,et al. Recognizing Micro-Actions and Reactions from Paired Egocentric Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61] Kaiming He,et al. Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62] Yaser Sheikh,et al. Monocular Total Capture: Posing Face, Body, and Hands in the Wild , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63] Takeo Kanade,et al. Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[64] Abhishek Das,et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[65] Frank J. Bernieri,et al. Synchrony, pseudosynchrony, and dissynchrony: Measuring the entrainment process in mother-infant interactions. , 1988 .

[66] Jiaxuan Wang,et al. HICO: A Benchmark for Recognizing Human-Object Interactions in Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[67] Juergen Gall,et al. PoseTrack: Joint Multi-person Pose Estimation and Tracking , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68] Lorenzo Torresani,et al. Detect-and-Track: Efficient Pose Estimation in Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[69] Mark Everingham,et al. Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation , 2010, BMVC.

[70] Deva Ramanan,et al. Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[71] James M. Rehg,et al. Detecting bids for eye contact using a wearable camera , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[72] Alex Graves,et al. Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[73] James M. Rehg,et al. Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.