Attention-Based Context Aware Reasoning for Situation Recognition

Situation Recognition (SR) is a fine-grained action recognition task where the model is expected to not only predict the salient action of the image, but also predict values of all associated semantic roles of the action. Predicting semantic roles is very challenging: a vast variety of possibilities can be the match for a semantic role. Existing work has focused on dependency modelling architectures to solve this issue. Inspired by the success achieved by query-based visual reasoning (e.g., Visual Question Answering), we propose to address semantic role prediction as a query-based visual reasoning problem. However, existing query-based reasoning methods have not considered handling of inter-dependent queries which is a unique requirement of semantic role prediction in SR. Therefore, to the best of our knowledge, we propose the first set of methods to address inter-dependent queries in query-based visual reasoning. Extensive experiments demonstrate the effectiveness of our proposed method which achieves outstanding performance on Situation Recognition task. Furthermore, leveraging query inter-dependency, our methods improve upon a state-of-the-art method that answers queries separately. Our code: https://github.com/thilinicooray/context-aware-reasoning-for-sr

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[4]  Jitendra Malik,et al.  Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Ali Farhadi,et al.  Commonly Uncommon: Semantic Sparsity in Situation Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jitendra Malik,et al.  Visual Semantic Role Labeling , 2015, ArXiv.

[8]  Cordelia Schmid,et al.  Expanded Parts Model for Human Attribute and Action Recognition in Still Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Svetlana Lazebnik,et al.  Recurrent Models for Situation Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Sanja Fidler,et al.  Situation Recognition with Graph Neural Networks , 2018 .

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[14]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Matthieu Cord,et al.  MUREL: Multimodal Relational Reasoning for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[19]  Changsong Liu,et al.  Grounded Semantic Role Labeling , 2016, NAACL.

[20]  Carina Silberer,et al.  Grounding Semantic Roles in Images , 2018, EMNLP.

[21]  Svetlana Lazebnik,et al.  Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering , 2016, ECCV.

[22]  Zhou Yu,et al.  Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[23]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Ngai-Man Cheung,et al.  Efficient and Deep Person Re-identification Using Multi-level Similarity , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Xiaogang Wang,et al.  Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Yue Zheng,et al.  Intention Oriented Image Captions With Guiding Objects , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[28]  Ngai-Man Cheung,et al.  Deep Adaptive Temporal Pooling for Activity Recognition , 2018, ACM Multimedia.

[29]  Anton van den Hengel,et al.  Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Michael S. Bernstein,et al.  Information Maximizing Visual Question Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Ali Farhadi,et al.  From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[37]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[38]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[39]  Ali Farhadi,et al.  Situation Recognition: Visual Semantic Role Labeling for Image Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[41]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).