论文信息 - Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions?

Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions?

We conduct large-scale studies on `human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). Overall, our experiments show that current attention models in VQA do not seem to be looking at the same regions as humans.

C. L. Zitnick | Harsh Agrawal | Dhruv Batra | Devi Parikh | Abhishek Das

[1] Christof Koch. Christof Koch , 2018, Current Biology.

[2] Stefan Lee,et al. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3] José M. F. Moura,et al. Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Abhishek Das,et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[5] Andrei Popescu-Belis,et al. Human versus Machine Attention in Document Classification: A Dataset with Crowdsourced Annotations , 2016, SocialNLP@EMNLP.

[6] Othman Omran Khalifa,et al. Multiple object recognition , 2016 .

[7] Zhiguo Wang,et al. Supervised Attentions for Neural Machine Translation , 2016, EMNLP.

[8] Jiasen Lu,et al. Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[9] Jiasen Lu,et al. Hierarchical Co-Attention for Visual Question Answering , 2016 .

[10] Jonathan Krause,et al. Leveraging the Wisdom of the Crowd for Fine-Grained Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11] Richard Socher,et al. Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[12] Dan Klein,et al. Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[13] Yoshua Bengio,et al. Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism , 2016, NAACL.

[14] Bolei Zhou,et al. Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Kate Saenko,et al. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[17] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Yoshua Bengio,et al. Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks , 2015, IEEE Transactions on Multimedia.

[19] Qi Zhao,et al. SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Saurabh Gupta,et al. Exploring Nearest Neighbor Approaches for Image Captioning , 2015, ArXiv.

[21] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[22] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[23] Koray Kavukcuoglu,et al. Multiple Object Recognition with Visual Attention , 2014, ICLR.

[24] Pierre Sermanet,et al. Attention for Fine-Grained Categorization , 2014, ICLR.

[25] Thomas Brox,et al. Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[26] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[28] Mario Fritz,et al. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[29] Qi Zhao,et al. Saliency in Crowd , 2014, ECCV.

[30] Alex Graves,et al. Recurrent Models of Visual Attention , 2014, NIPS.

[31] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[32] Andrew Zisserman,et al. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[33] Rob Fergus,et al. Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[34] Jonathan Krause,et al. Fine-Grained Crowdsourcing for Fine-Grained Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[35] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[36] Frédo Durand,et al. Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[37] Benjamin W Tatler,et al. The central fixation bias in scene viewing: selecting an optimal viewing position independently of motor biases and image feature distributions. , 2007, Journal of vision.

[38] P. Perona,et al. What do we perceive in a glance of a real-world scene? , 2007, Journal of vision.

[39] Laura A. Dabbish,et al. Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[40] M. Hayhoe,et al. In what ways do eye movements contribute to everyday activities? , 2001, Vision Research.

[41] Ronald A. Rensink. The Dynamic Representation of Scenes , 2000 .

[42] P.J. Denning,et al. On learning how to predict , 1980, Proceedings of the IEEE.

[43] A. L. I︠A︡rbus. Eye Movements and Vision , 1967 .

[44] A. L. Yarbus,et al. Eye Movements and Vision , 1967, Springer US.