Dual self-attention with co-attention networks for visual question answering

Abstract Visual Question Answering (VQA) as an important task in understanding vision and language has been proposed and aroused wide interests. In previous VQA methods, Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are generally used to extract visual and textual features respectively, and then the correlation between these two features is explored to infer the answer. However, CNN mainly focuses on extracting local spatial information and RNN pays more attention on exploiting sequential architecture and long-range dependencies. It is difficult for them to integrate the local features with their global dependencies to learn more effective representations of the image and question. To address this problem, we propose a novel model, i.e., Dual Self-Attention with Co-Attention networks (DSACA), for VQA. It aims to model the internal dependencies of both the spatial and sequential structure respectively by using the newly proposed self-attention mechanism. Specifically, DSACA mainly contains three submodules. The visual self-attention module selectively aggregates the visual features at each region by a weighted sum of the features at all positions. The textual self-attention module automatically emphasizes the interdependent word features by integrating associated features among the sentence words. Besides, the visual-textual co-attention module explores the close correlation between visual and textual features learned from self-attention modules. The three modules are integrated into an end-to-end framework to infer the answer. Extensive experiments performed on three generally used VQA datasets confirm the favorable performance of DSACA compared with state-of-the-art methods.

[1]  Mohammad Mahmudur Rahman Khan,et al.  Preoperative angular insertion depth prediction in case of lateral wall cochlear implant electrode arrays , 2020, Medical Imaging: Image-Guided Procedures.

[2]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[3]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[4]  Lei Zhu,et al.  Online Cross-Modal Hashing for Web Image Retrieval , 2016, AAAI.

[5]  Zhen Yang,et al.  Semi-Supervised Metric Learning-Based Anchor Graph Hashing for Large-Scale Image Retrieval , 2019, IEEE Transactions on Image Processing.

[6]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[9]  Heng Tao Shen,et al.  From Pixels to Objects: Cubic Visual Attention for Visual Question Answering , 2018, IJCAI.

[10]  Jason Weston,et al.  Open Question Answering with Weakly Supervised Embedding Models , 2014, ECML/PKDD.

[11]  Zhou Su,et al.  Learning Visual Knowledge Memory Networks for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Lin Ma,et al.  Learning to Answer Questions from Image Using Convolutional Neural Network , 2015, AAAI.

[13]  Tao Mei,et al.  Multi-level Attention Networks for Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Bohyung Han,et al.  Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[16]  Mohammad Mahmudur Rahman Khan,et al.  Preoperative prediction of insertion depth of lateral wall cochlear implant electrode arrays , 2020, Medical Imaging: Image-Guided Procedures.

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Tao Shen,et al.  DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding , 2017, AAAI.

[20]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Wei Zhang,et al.  R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering , 2018, KDD.

[22]  Xiang Cheng,et al.  Knowledge-based Question Answering by Jointly Generating, Copying and Paraphrasing , 2017, CIKM.

[23]  Yueting Zhuang,et al.  Feature Enhancement in Attention for Visual Question Answering , 2018, IJCAI.

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[27]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[28]  Takayuki Okatani,et al.  Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Zhoujun Li,et al.  ALSA: Adversarial Learning of Supervised Attentions for Visual Question Answering , 2020, IEEE Transactions on Cybernetics.

[30]  Partha Pratim Talukdar,et al.  KVQA: Knowledge-Aware Visual Question Answering , 2019, AAAI.

[31]  Wei Zhang,et al.  Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering , 2017, AAAI.

[32]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Anton van den Hengel,et al.  Image Captioning and Visual Question Answering Based on Attributes and Their Related External Knowledge , 2016, ArXiv.

[34]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[35]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[37]  Chunhua Shen,et al.  Explicit Knowledge-based Reasoning for Visual Question Answering , 2015, IJCAI.

[38]  Changsheng Xu,et al.  CSAN: Contextual Self-Attention Network for User Sequential Recommendation , 2018, ACM Multimedia.

[39]  Matthieu Cord,et al.  MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[42]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Li Fei-Fei,et al.  Knowledge Acquisition for Visual Question Answering via Iterative Querying , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Xiaojie Wang,et al.  Object-Difference Attention: A Simple Relational Attention for Visual Question Answering , 2018, ACM Multimedia.

[45]  Xiaogang Wang,et al.  Person Search with Natural Language Description , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Rongrong Ji,et al.  Dynamic Capsule Attention for Visual Question Answering , 2019, AAAI.

[47]  Md. Abu Bakr Siddique,et al.  Study and Observation of the Variations of Accuracies for Handwritten Digits Recognition with Various Hidden Layers and Epochs using Convolutional Neural Network , 2018, 2018 4th International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT).

[48]  Zhoujun Li,et al.  Adversarial Learning With Multi-Modal Attention for Visual Question Answering. , 2020, IEEE transactions on neural networks and learning systems.

[49]  Md. Abu Bakr Siddique,et al.  Study and Observation of the Variations of Accuracies for Handwritten Digits Recognition with Various Hidden Layers and Epochs using Neural Network Algorithm , 2018, 2018 4th International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT).

[50]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[51]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[52]  Chengqi Zhang,et al.  Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling , 2018, ICLR.

[53]  Vinay P. Namboodiri,et al.  Differential Attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[55]  Feiran Huang,et al.  Adversarial Learning of Answer-Related Representation for Visual Question Answering , 2018, CIKM.

[56]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.