论文信息 - iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection

iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection

Recent years have witnessed rapid progress in detecting and recognizing individual object instances. To understand the situation in a scene, however, computers need to recognize how humans interact with surrounding objects. In this paper, we tackle the challenging task of detecting human-object interactions (HOI). Our core idea is that the appearance of a person or an object instance contains informative cues on which relevant parts of an image to attend to for facilitating interaction prediction. To exploit these cues, we propose an instance-centric attention module that learns to dynamically highlight regions in an image conditioned on the appearance of each instance. Such an attention-based network allows us to selectively aggregate features relevant for recognizing HOIs. We validate the efficacy of the proposed network on the Verb in COCO and HICO-DET datasets and show that our approach compares favorably with the state-of-the-arts.

[1] Yejin Choi,et al. Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2] Xiaogang Wang,et al. Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[4] Jiaxuan Wang,et al. HICO: A Benchmark for Recognizing Human-Object Interactions in Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5] Cordelia Schmid,et al. P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6] Trevor Darrell,et al. Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Jitendra Malik,et al. Visual Semantic Role Labeling , 2015, ArXiv.

[8] Derek Hoiem,et al. Diagnosing Error in Object Detectors , 2012, ECCV.

[9] Andrea Vedaldi,et al. Weakly Supervised Deep Detection Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[11] Iasonas Kokkinos,et al. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12] Deva Ramanan,et al. Attentional Pooling for Action Recognition , 2017, NIPS.

[13] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14] Shih-Fu Chang,et al. PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16] Ian D. Reid,et al. Towards Context-Aware Interaction Recognition for Visual Relationship Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17] Ivan Laptev,et al. Weakly-Supervised Learning of Visual Relations , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18] Michael S. Bernstein,et al. Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Yi Li,et al. R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[20] Kaiming He,et al. Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Christoph H. Lampert,et al. Detecting Visual Relationships Using Box Attention , 2018, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[22] Luc Van Gool,et al. Does Human Action Recognition Benefit from Pose Estimation? , 2011, BMVC.

[23] Xiaogang Wang,et al. ViP-CNN: Visual Phrase Guided Convolutional Neural Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Svetlana Lazebnik,et al. Phrase Localization and Visual Relationship Detection with Comprehensive Linguistic Cues , 2016, ArXiv.

[25] Philip H. S. Torr,et al. Learn To Pay Attention , 2018, ICLR.

[26] Fei-Fei Li,et al. Combining randomization and discrimination for fine-grained image categorization , 2011, CVPR 2011.

[27] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Li Fei-Fei,et al. Scaling Human-Object Interaction Recognition Through Zero-Shot Learning , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[30] Danfei Xu,et al. Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Svetlana Lazebnik,et al. Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering , 2016, ECCV.

[32] Jitendra Malik,et al. Contextual Action Recognition with R*CNN , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.

[34] Michael S. Bernstein,et al. Visual Relationship Detection with Language Priors , 2016, ECCV.

[35] Jia Deng,et al. Learning to Detect Human-Object Interactions , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[36] Subhransu Maji,et al. Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[37] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[38] Bo Dai,et al. Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Kaiming He,et al. Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42] Larry S. Davis,et al. Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.