Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering

Visual question answering (VQA) is receiving increasing attention from researchers in both the computer vision and natural language processing fields. There are two key components in the VQA task: feature extraction and multi-modal fusion. For feature extraction, we introduce a novel co-attention scheme by combining Sentence-guide Word Attention (SWA) and Question-guide Image Attention in a unified framework. To be specific, the textual attention SWA relies on the semantics of the whole question sentence to calculate contributions of different question words for text representation. For the multi-modal fusion, we propose a “Cross-modal Multistep Fusion (CMF)” network to generate multistep features and achieve multiple interactions for two modalities, rather than focusing on modeling complex interactions between two modals like most current feature fusion methods. To avoid the linear increase of the computational cost, we share the parameters for each step in the CMF. Extensive experiments demonstrate that the proposed method can achieve competitive or better performance than the state of the art.

[1]  S LewMichael,et al.  Deep learning for visual understanding , 2016 .

[2]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[3]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[6]  Kewei Tu,et al.  Structured Attentions for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yi Yang,et al.  Attention to Scale: Scale-Aware Semantic Image Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ausif Mahmood,et al.  Convolutional Recurrent Deep Learning Model for Sentence Classification , 2018, IEEE Access.

[10]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Qi Wu,et al.  Visual Question Answering: A Tutorial , 2017, IEEE Signal Processing Magazine.

[12]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[13]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[15]  Wenpeng Yin,et al.  Attention-Based Convolutional Neural Network for Machine Comprehension , 2016, ArXiv.

[16]  Shuicheng Yan,et al.  A Focused Dynamic Attention Model for Visual Question Answering , 2016, ArXiv.

[17]  Limin Wang,et al.  Places205-VGGNet Models for Scene Recognition , 2015, ArXiv.

[18]  Dan Klein,et al.  Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[19]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[20]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yuandong Tian,et al.  Simple Baseline for Visual Question Answering , 2015, ArXiv.

[22]  Saurabh Singh,et al.  Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[24]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[25]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[26]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[27]  Subhransu Maji,et al.  Bilinear Convolutional Neural Networks for Fine-Grained Visual Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[29]  Yu Qiao,et al.  A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.

[30]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[32]  Michael S. Lew,et al.  Deep learning for visual understanding: A review , 2016, Neurocomputing.

[33]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[34]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Hsinchun Chen,et al.  Identifying Top Sellers In Underground Economy Using Deep Learning-Based Sentiment Analysis , 2014, 2014 IEEE Joint Intelligence and Security Informatics Conference.

[36]  Matthieu Cord,et al.  MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Xiaoxiao Li,et al.  Semantic Image Segmentation via Deep Parsing Network , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Byoung-Tak Zhang,et al.  Multimodal Residual Learning for Visual QA , 2016, NIPS.

[41]  Bohyung Han,et al.  Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Fathi M. Salem,et al.  Gate-variants of Gated Recurrent Unit (GRU) neural networks , 2017, 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS).

[43]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[44]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[45]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[46]  Franck Dernoncourt,et al.  Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks , 2016, NAACL.

[47]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Yu Liu,et al.  Learning a Recurrent Residual Fusion Network for Multimodal Matching , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.