Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video

We address the challenging task of anticipating human-object interaction in first person videos. Most existing methods ignore how the camera wearer interacts with the objects, or simply consider body motion as a separate modality. In contrast, we observe that the international hand movement reveals critical information about the future activity. Motivated by this, we adopt intentional hand movement as a future representation and propose a novel deep network that jointly models and predicts the egocentric hand motion, interaction hotspots and future action. Specifically, we consider the future hand motion as the motor attention, and model this attention using latent variables in our deep model. The predicted motor attention is further used to characterise the discriminative spatial-temporal visual features for predicting actions and interaction hotspots. We present extensive experiments demonstrating the benefit of the proposed joint model. Importantly, our model produces new state-of-the-art results for action anticipation on both EGTEA Gaze+ and the EPIC-Kitchens datasets. Our project page is available at this https URL

[1]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  Bingbing Ni,et al.  Egocentric Activity Prediction via Event Modulated Attention , 2018, ECCV.

[4]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  James M. Rehg,et al.  Paying More Attention to Motion: Attention Distillation for Learning Video Representations , 2019, ArXiv.

[6]  Kristen Grauman,et al.  Grounded Human-Object Interaction Hotspots From Video , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Ramakant Nevatia,et al.  RED: Reinforced Encoder-Decoder Networks for Action Anticipation , 2017, BMVC.

[8]  Otmar Hilliges,et al.  Structured Prediction Helps 3D Human Motion Modelling , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Alexei A. Efros,et al.  Scene Semantics from Long-Term Observation of People , 2012, ECCV.

[10]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[11]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[12]  Nanning Zheng,et al.  Inferring Human Attention by Learning Latent Intentions , 2017, IJCAI.

[13]  David J. Fleet,et al.  Erratum: "Gaussian process dynamical models for human motion" (IEEE Transactions on Pattern analysis and Machine Intelligenc (292)) , 2008 .

[14]  James M. Rehg,et al.  In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video , 2018, ECCV.

[15]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[16]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  L. Riggio,et al.  The role of attention in the occurrence of the affordance effect. , 2008, Acta psychologica.

[18]  G. Rizzolatti,et al.  Activation of human primary motor cortex during action observation: a neuromagnetic study. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Giovanni Maria Farinella,et al.  Next-active-object prediction from egocentric videos , 2017, J. Vis. Commun. Image Represent..

[20]  Jitendra Malik,et al.  What will Happen Next? Forecasting Player Moves in Sports Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Heng Wang,et al.  Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Vladimir Pavlovic,et al.  Learning Switching Linear Models of Human Motion , 2000, NIPS.

[24]  Silvio Savarese,et al.  Demo2Vec: Reasoning Object Affordances from Online Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  W. James,et al.  The Principles of Psychology. , 1983 .

[26]  Yin Li,et al.  In the Eye of the Beholder: Gaze and Actions in First Person Video , 2021, IEEE transactions on pattern analysis and machine intelligence.

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Larry H. Matthies,et al.  Pooled motion features for first-person videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Martial Hebert,et al.  The Pose Knows: Video Forecasting by Generating Pose Futures , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Shmuel Peleg,et al.  Compact CNN for indexing egocentric videos , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[31]  Yun Fu,et al.  Human Action Recognition and Prediction: A Survey , 2018, International Journal of Computer Vision.

[32]  Yoichi Sato,et al.  Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition , 2018, ECCV.

[33]  Jianbo Shi,et al.  Egocentric Future Localization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Luc Van Gool,et al.  What makes a chair a chair? , 2011, CVPR 2011.

[35]  Giovanni Maria Farinella,et al.  Leveraging Uncertainty to Rethink Loss Functions and Evaluation Measures for Egocentric Action Anticipation , 2018, ECCV Workshops.

[36]  C. V. Jawahar,et al.  First Person Action Recognition Using Deep Learned Descriptors , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[40]  James M. Rehg,et al.  Delving into egocentric actions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Jake K. Aggarwal,et al.  Robot-Centric Activity Prediction from First-Person Videos: What Will They Do to Me? , 2015, 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[43]  Zhuowen Tu,et al.  Deeply Supervised Salient Object Detection with Short Connections , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Yunde Jia,et al.  Parsing video events with goal inference and intent prediction , 2011, 2011 International Conference on Computer Vision.

[45]  H. Spencer The Principles of Psychology - Vol. I , 2016 .

[46]  Giovanni Maria Farinella,et al.  What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  G. Rizzolatti,et al.  Understanding motor events: a neurophysiological study , 2004, Experimental Brain Research.

[48]  M. Rushworth,et al.  The left parietal and premotor cortices: motor attention and selection , 2003, NeuroImage.

[49]  Bernt Schiele,et al.  Time-Conditioned Action Anticipation in One Shot , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  José M. F. Moura,et al.  Adversarial Geometry-Aware Human Motion Prediction , 2018, ECCV.

[51]  Ali Farhadi,et al.  Generating Notifications for Missing Actions: Don't Forget to Turn the Lights Off! , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[52]  Nicholas Rhinehart,et al.  Learning Action Maps of Large Environments via First-Person Vision , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Abhinav Gupta,et al.  Binge Watching: Scaling Affordance Learning from Sitcoms , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[55]  Masaki Hayashi,et al.  Recognition of Transitional Action for Short-Term Action Prediction using Discriminative Temporal CNN Feature , 2016, BMVC.

[56]  Petros Daras,et al.  Deep Affordance-Grounded Sensorimotor Object Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Yoichi Sato,et al.  Future Person Localization in First-Person Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Bingbing Ni,et al.  Cascaded Interactional Targeting Network for Egocentric Video Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Martial Hebert,et al.  Cross-Stitch Networks for Multi-task Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  C. Urgesi,et al.  Action anticipation and motor resonance in elite basketball players , 2008, Nature Neuroscience.

[63]  Kristen Grauman,et al.  Subjects and Their Objects: Localizing Interactees for a Person-Centric View of Importance , 2016, International Journal of Computer Vision.

[64]  Ivan Laptev,et al.  Leveraging the Present to Anticipate the Future in Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[65]  Nicholas Rhinehart,et al.  First-Person Activity Forecasting with Online Inverse Reinforcement Learning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[66]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  David J. Fleet,et al.  Topologically-constrained latent variable models , 2008, ICML '08.

[68]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[69]  Heng Wang,et al.  Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.