VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual Question Answering

We present VQA-MHUG – a novel 49-participant dataset of multimodal human gaze on both images and questions during visual question answering (VQA) collected using a high-speed eye tracker. We use our dataset to analyze the similarity between human and neural attentive strategies learned by five state-of-the-art VQA models: Modular Co-Attention Network (MCAN) with either grid or region features, Pythia, Bilinear Attention Network (BAN), and the Multimodal Factorized Bilinear Pooling Network (MFB). While prior work has focused on studying the image modality, our analyses show – for the first time – that for all models, higher correlation with human attention on text is a significant predictor of VQA performance. This finding points at a potential for improving VQA performance and, at the same time, calls for further research on neural text attention mechanisms and their integration into architectures for vision and language tasks, including but potentially also beyond VQA.

[1]  Yusuke Sugano,et al.  Seeing with Humans: Gaze-Assisted Neural Image Captioning , 2016, ArXiv.

[2]  Alana de Santana Correia,et al.  Attention, please! A survey of neural attention models in deep learning , 2021, Artificial Intelligence Review.

[3]  Allan Jabri,et al.  Revisiting Visual Question Answering Baselines , 2016, ECCV.

[4]  Byoung-Tak Zhang,et al.  Multimodal Residual Learning for Visual QA , 2016, NIPS.

[5]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[6]  Viv Bewick,et al.  Statistics review 7: Correlation and regression , 2003, Critical care.

[7]  Takayuki Okatani,et al.  Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  C. Constantinidis,et al.  Bottom-Up and Top-Down Attention , 2014, The Neuroscientist : a review journal bringing neurobiology, neurology and psychiatry.

[9]  Jorma Laaksonen,et al.  Paying Attention to Descriptions Generated by Image Captioning Models , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Chuang Gan,et al.  Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering , 2019, AAAI.

[12]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Ling Shao,et al.  Understanding More About Human and Machine Attention in Deep Neural Networks , 2021, IEEE Transactions on Multimedia.

[14]  Jianfeng Dong,et al.  Exploring Human-like Attention Supervision in Visual Question Answering , 2017, AAAI.

[15]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[16]  Yifan Peng,et al.  Studying Relationships between Human Gaze, Description, and Computer Vision , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Chao Yang,et al.  Co-Attention Network With Question Type for Visual Question Answering , 2019, IEEE Access.

[18]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[20]  Xinlei Chen,et al.  Pythia v0.1: the Winning Entry to the VQA Challenge 2018 , 2018, ArXiv.

[21]  Trevor Darrell,et al.  Multimodal Explanations: Justifying Decisions and Pointing to the Evidence , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Andreas Bulling,et al.  Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention , 2020, NeurIPS.

[23]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[24]  Dhruv Batra,et al.  Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[25]  Jonathon S. Hare,et al.  Learning to Count Objects in Natural Images for Visual Question Answering , 2018, ICLR.

[26]  Shi Chen,et al.  AiR: Attention with Reasoning Capability , 2020, ECCV.

[27]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[28]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[29]  Xinlei Chen,et al.  In Defense of Grid Features for Visual Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Benjamin W Tatler,et al.  The central fixation bias in scene viewing: selecting an optimal viewing position independently of motor biases and image feature distributions. , 2007, Journal of vision.

[31]  Xinlei Chen,et al.  Cycle-Consistency for Robust Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[34]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[35]  Christopher Kanan,et al.  An Analysis of Visual Question Answering Algorithms , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[37]  Chen Sun,et al.  VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Dhruv Batra,et al.  Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Ngoc Thang Vu,et al.  Interpreting Attention Models with Human Visual Attention in Machine Reading Comprehension , 2020, CONLL.

[40]  Dhruv Batra,et al.  Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.

[41]  Eric Horvitz,et al.  SQuINTing at VQA Models: Introspecting VQA Models With Sub-Questions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Niels da Vitoria Lobo,et al.  MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering , 2020, FINDINGS.

[43]  Qi Zhao,et al.  SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Zhou Yu,et al.  Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[46]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[47]  Ali Borji,et al.  Human Attention in Image Captioning: Dataset and Analysis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Aude Oliva,et al.  How Much Time Do You Have? Modeling Multi-Duration Saliency , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Jorma Laaksonen,et al.  Saliency Revisited: Analysis of Mouse Movements Versus Fixations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Ignace T C Hooge,et al.  Gaze tracking accuracy in humans: One eye is sometimes better than two , 2018, Behavior Research Methods.

[52]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[53]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Christopher Kanan,et al.  Challenges and Prospects in Vision and Language Research , 2019, Front. Artif. Intell..

[55]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[56]  Yash Goyal,et al.  Yin and Yang: Balancing and Answering Binary Visual Questions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.