Hand-Object Interaction Reasoning

This paper proposes an interaction reasoning network for modelling spatio-temporal relationships between hands and objects in video. The proposed interaction unit utilises a Transformer module to reason about each acting hand, and its spatio-temporal relation to the other hand as well as objects being interacted with. We show that modelling two-handed interactions are critical for action recognition in egocentric video, and demonstrate that by using positionally-encoded trajectories, the network can better recognise observed interactions. We evaluate our proposal on EPIC-KITCHENS and Something-Else datasets, with an ablation study.

[1]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[2]  David F. Fouhey,et al.  Understanding Human Hands in Contact at Internet Scale , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Xuming He,et al.  Pose-Aware Multi-Level Feature Network for Human Object Interaction Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[7]  Mingmin Chi,et al.  Relation Parsing Neural Network for Human-Object Interaction Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Trevor Darrell,et al.  Something-Else: Compositional Action Recognition With Spatial-Temporal Interaction Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Christoph Feichtenhofer,et al.  X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Abhinav Gupta,et al.  Videos as Space-Time Region Graphs , 2018, ECCV.

[12]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Marie-Francine Moens,et al.  Revisiting spatio-temporal layouts for compositional action recognition , 2021, BMVC.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Jia Deng,et al.  Learning to Detect Human-Object Interactions , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[16]  Zheng Shou,et al.  Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Chen Gao,et al.  iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection , 2018, BMVC.

[18]  Christian Wolf,et al.  Object Level Visual Reasoning in Videos , 2018, ECCV.

[19]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[24]  Chao Li,et al.  Collaborative Spatiotemporal Feature Learning for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Sen Jia,et al.  How Much Position Information Do Convolutional Neural Networks Encode? , 2020, ICLR.

[28]  Gedas Bertasius,et al.  Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[29]  D. Damen,et al.  Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 , 2020, International Journal of Computer Vision.

[30]  Cewu Lu,et al.  Transferable Interactiveness Knowledge for Human-Object Interaction Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Omri Bar,et al.  Video Transformer Network , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[33]  Kaiming He,et al.  Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[35]  Amir Roshan Zamir,et al.  Action Recognition in Realistic Sports Videos , 2014 .

[36]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Cordelia Schmid,et al.  Actor-Centric Relation Network , 2018, ECCV.

[38]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[40]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[41]  Alexander Kolesnikov,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[42]  Chen Gao,et al.  DRG: Dual Relation Graph for Human-Object Interaction Detection , 2020, ECCV.

[43]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.