论文信息 - Segmentation Guided Attention Networks for Visual Question Answering

Segmentation Guided Attention Networks for Visual Question Answering

In this paper we propose to solve the problem of Visual Question Answering by using a novel segmentation guided attention based network which we call SegAttendNet. We use image segmentation maps, generated by a Fully Convolutional Deep Neural Network to refine our attention maps and use these refined attention maps to make the model focus on the relevant parts of the image to answer a question. The refined attention maps are used by the LSTM network to learn to produce the answer. We presently train our model on the visual7W dataset and do a category wise evaluation of the 7 question categories. We achieve state of the art results on this dataset and beat the previous benchmark on this dataset by a 1.5% margin improving the question answering accuracy from 54.1% to 55.6% and demonstrate improvements in each of the question categories. We also visualize our generated attention maps and note their improvement over the attention maps generated by the previous best approach.

Vasu Sharma | Ankita Bishnu | Labhesh Patel

[1] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2] Sanja Fidler,et al. The Role of Context for Object Detection and Semantic Segmentation in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[4] Richard Socher,et al. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[5] Richard S. Zemel,et al. Exploring Models and Data for Image Question Answering , 2015, NIPS.

[6] Michael S. Bernstein,et al. Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Jonathan Berant,et al. Semantic Parsing via Paraphrasing , 2014, ACL.

[8] Saurabh Singh,et al. Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Wei Xu,et al. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[10] Jiasen Lu,et al. Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[11] Jason Weston,et al. Memory Networks , 2014, ICLR.

[12] Jason Weston,et al. Question Answering with Subgraph Embeddings , 2014, EMNLP.

[13] Mario Fritz,et al. Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[15] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Shuicheng Yan,et al. A Focused Dynamic Attention Model for Visual Question Answering , 2016, ArXiv.

[17] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).