论文信息 - AnswerNet: Learning to Answer Questions

AnswerNet: Learning to Answer Questions

Multi-modal tasks like visual question answering (VQA) are an important step towards human-level artificial intelligence. In general, the input of the VQA task consists of an image and a related question. In order to correctly answer the question, a model needs to extract and integrate useful information from both the image and the question. In this paper, we propose a model named AnswerNet to tackle this task. In the proposed model, discriminative features are extracted from both the image and the question. Specifically, high-level image features are extracted by the state-of-the-art convolutional neural network, i.e., Deep Residual Net. For question features, the semantic representations of the question and the term frequencies of the distinct words are captured by long short-term memory network and bag-of-words model, respectively. Then, a hierarchical fusion network is proposed to effectively fuse the image features with the question features. Experimental results on three large-scale datasets, VQA, COCO-QA, and VQA2, demonstrate the effectiveness of the proposed AnswerNet.

Haibo He | Zhiqiang Wan | Haibo He | Zhiqiang Wan

[1] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[2] Jiasen Lu,et al. Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[3] Jian Sun,et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[5] Trevor Darrell,et al. Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[7] Seunghoon Hong,et al. Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network , 2015, ICML.

[8] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[9] Cordelia Schmid,et al. Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10] Ke Chen,et al. Learning to Classify Fine-Grained Categories with Privileged Visual-Semantic Misalignment , 2017, IEEE Transactions on Big Data.

[11] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[12] Zhou Yu,et al. Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13] Bo Tang,et al. A Generative Model for Sparse Hyperparameter Determination , 2018, IEEE Transactions on Big Data.

[14] Wei Xu,et al. ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering , 2015, ArXiv.

[15] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Yuandong Tian,et al. Simple Baseline for Visual Question Answering , 2015, ArXiv.

[17] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[18] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19] Francesco Visin,et al. A guide to convolution arithmetic for deep learning , 2016, ArXiv.

[20] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Ming-Hsuan Yang,et al. Hierarchical Convolutional Features for Visual Tracking , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22] Mario Fritz,et al. Ask Your Neurons: A Deep Learning Approach to Visual Question Answering , 2016, International Journal of Computer Vision.

[23] Kate Saenko,et al. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[24] Richard S. Zemel,et al. Exploring Models and Data for Image Question Answering , 2015, NIPS.

[25] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[26] Bohyung Han,et al. Training Recurrent Answering Units with Joint Loss Minimization for VQA , 2016, ArXiv.

[27] Dan Klein,et al. Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Peng Wang,et al. Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Bohyung Han,et al. Learning Multi-domain Convolutional Neural Networks for Visual Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Jung-Woo Ha,et al. Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Haibo He,et al. Weakly supervised object localization with deep convolutional neural network based on spatial pyramid saliency map , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[33] Jiebo Luo,et al. Weakly Semi-Supervised Deep Learning for Multi-Label Image Annotation , 2015, IEEE Transactions on Big Data.

[34] Mario Fritz,et al. Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35] Bohyung Han,et al. Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[37] Trevor Darrell,et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[38] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Jürgen Schmidhuber,et al. LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[40] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[41] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[42] Lin Ma,et al. Learning to Answer Questions from Image Using Convolutional Neural Network , 2015, AAAI.

[43] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44] Trevor Darrell,et al. Attentive Explanations: Justifying Decisions and Pointing to the Evidence , 2016, ArXiv.

[45] Jieping Ye,et al. Deep Model Based Transfer and Multi-Task Learning for Biological Image Analysis , 2020, IEEE Transactions on Big Data.

[46] Dan Klein,et al. Learning to Compose Neural Networks for Question Answering , 2016, NAACL.