H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions

We present a unified framework for understanding 3D hand and object interactions in raw image sequences from egocentric RGB cameras. Given a single RGB image, our model jointly estimates the 3D hand and object poses, models their interactions, and recognizes the object and action classes with a single feed-forward pass through a neural network. We propose a single architecture that does not rely on external detection algorithms but rather is trained end-to-end on single images. We further merge and propagate information in the temporal domain to infer interactions between hand and object trajectories and recognize actions. The complete model takes as input a sequence of frames and outputs per-frame 3D hand and object pose predictions along with the estimates of object and action categories for the entire sequence. We demonstrate state-of-the-art performance of our algorithm even in comparison to the approaches that work on depth data and ground-truth annotations.

[1]  Cordelia Schmid,et al.  Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Tae-Kyun Kim,et al.  Learning Motion Categories using both Semantic and Structural Information , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Antonis A. Argyros,et al.  Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints , 2011, 2011 International Conference on Computer Vision.

[4]  Pavlo Molchanov,et al.  Hand Pose Estimation via Latent 2.5D Heatmap Regression , 2018, ECCV.

[5]  Deva Ramanan,et al.  Understanding Everyday Hands in Action from RGB-D Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Fabien Lagriffoul,et al.  Activity Recognition Using an Egocentric Perspective of Everyday Objects , 2007, UIC.

[7]  W. Marsden I and J , 2012 .

[8]  Zicheng Liu,et al.  HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Vincent Lepetit,et al.  LIFT: Learned Invariant Feature Transform , 2016, ECCV.

[10]  Antonis A. Argyros,et al.  Joint 3D Tracking of a Deformable Object in Interaction with a Hand , 2018, ECCV.

[11]  Otmar Hilliges,et al.  Cross-Modal Deep Variational Hand Pose Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  James M. Rehg,et al.  Delving into egocentric actions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  M. Pollefeys,et al.  Unified Egocentric Recognition of 3 D Hand-Object Poses and Interactions , 2019 .

[14]  Eric Brachmann,et al.  Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  F BobickAaron,et al.  The Recognition of Human Movement Using Temporal Templates , 2001 .

[16]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[17]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  J. Gower Generalized procrustes analysis , 1975 .

[19]  Tae-Kyun Kim,et al.  Latent-Class Hough Forests for 3D Object Detection and Pose Estimation , 2014, ECCV.

[20]  Jitendra Malik,et al.  From Lifestyle Vlogs to Everyday Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Dimitrios Tzionas,et al.  Embodied hands , 2017, ACM Trans. Graph..

[22]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[23]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[24]  Nassir Navab,et al.  SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Sang Ho Yoon,et al.  Robust Hand Pose Estimation during the Interaction with an Unknown Object , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Danica Kragic,et al.  Hands in action: real-time 3D reconstruction of hands in interaction with objects , 2010, 2010 IEEE International Conference on Robotics and Automation.

[27]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2008, International Journal of Computer Vision.

[28]  Qi Ye,et al.  BigHand2.2M Benchmark: Hand Pose Dataset and State of the Art Analysis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Ajmal Mian,et al.  3D Action Recognition from Novel Viewpoints , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Mingrui Wu,et al.  Gradient descent optimization of smoothed information retrieval metrics , 2010, Information Retrieval.

[31]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[32]  Larry H. Matthies,et al.  First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  V. Lepetit,et al.  EPnP: An Accurate O(n) Solution to the PnP Problem , 2009, International Journal of Computer Vision.

[35]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[37]  Jian-Huang Lai,et al.  Jointly Learning Heterogeneous Features for RGB-D Activity Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Shmuel Peleg,et al.  Compact CNN for indexing egocentric videos , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[39]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[40]  Vincent Lepetit,et al.  Efficiently Creating 3D Training Data for Fine Hand Pose Estimation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Marc Pollefeys,et al.  Capturing Hands in Action Using Discriminative Salient Points and Physics Simulation , 2015, International Journal of Computer Vision.

[42]  Pascal Fua,et al.  Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[43]  Walterio W. Mayol-Cuevas,et al.  High level activity recognition using low resolution wearable vision , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[44]  Yoichi Sato,et al.  Understanding Hand-Object Manipulation with Grasp Types and Object Attributes , 2016, Robotics: Science and Systems.

[45]  Christian Wolf,et al.  Object Level Visual Reasoning in Videos , 2018, ECCV.

[46]  Cordelia Schmid,et al.  PoTion: Pose MoTion Representation for Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[48]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[49]  Larry H. Matthies,et al.  Pooled motion features for first-person videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Qi Ye,et al.  Occlusion-aware Hand Pose Estimation Using Hierarchical Mixture Density Network , 2017, ECCV.

[51]  Antonis A. Argyros,et al.  3D Tracking of Human Hands in Interaction with Unknown Objects , 2015, BMVC.

[52]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[54]  Vincent Lepetit,et al.  BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[55]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[56]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Xiaofeng Ren,et al.  Figure-ground segmentation improves handled object recognition in egocentric video , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[58]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[59]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[60]  Sergio Escalera,et al.  Depth-Based 3D Hand Pose Estimation: From Current Achievements to Future Goals , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61]  Luc Van Gool,et al.  Tracking a hand manipulating an object , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[62]  Vincent Lepetit,et al.  Training a Feedback Loop for Hand Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[63]  Pascal Fua,et al.  Real-Time Seamless Single Shot 6D Object Pose Prediction , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65]  Christian Theobalt,et al.  GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[66]  Deva Ramanan,et al.  First-person pose recognition using egocentric workspaces , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[68]  Thomas Brox,et al.  Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[69]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[70]  Antti Oulasvirta,et al.  Fast and robust hand tracking using detection-guided optimization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Shanxin Yuan,et al.  First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[72]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[73]  Yansong Tang,et al.  Action recognition in RGB-D egocentric videos , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[74]  Antti Oulasvirta,et al.  Real-Time Joint Tracking of a Hand Manipulating an Object from RGB-D Input , 2016, ECCV.

[75]  Jianbo Shi,et al.  First Person Action-Object Detection with EgoNet , 2016, Robotics: Science and Systems.

[76]  Cordelia Schmid,et al.  Actor and Observer: Joint Modeling of First and Third-Person Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[77]  Christian Theobalt,et al.  Real-Time Hand Tracking Under Occlusion from an Egocentric RGB-D Sensor , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[78]  Cordelia Schmid,et al.  Joint Learning of Object and Action Detectors , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).