Examine before You Answer: Multi-task Learning with Adaptive-attentions for Multiple-choice VQA

Multiple-choice (MC) Visual Question Answering (VQA) is a similar but essentially different task to open-ended VQA because the answer options are provided. Most of existing works tackle them in a unified pipeline by solving a multi-class problem to infer the best answer from a predefined answer set. The option that matches the best answer is selected for MC VQA. Nevertheless, this violates human thinking logics. Normally, people examine the questions, answer options and the reference image before inferring a MC VQA. For MC VQA, human either rely on the question and answer options to directly deduce a correct answer if the question is not image-related, or read the question and answer options and then purposefully search for answers in a reference image. Therefore, we propose a novel approach, namely Multi-task Learning with Adaptive-attention (MTA), to simulate human logics for MC VQA. Specifically, we first fuse the answer options and question features, and then adaptively attend to the visual features for inferring a MC VQA. Furthermore, we design our model as a multi-task learning architecture by integrating the open-ended VQA task to further boost the performance of MC VQA. We evaluate our approach on two standard benchmark datasets: VQA and Visual7W and our approach sets new records on both datasets for MC VQA task, reaching 73.5% and 65.9% average accuracy respectively.

[1]  Hanzhang Wang,et al.  Richer Semantic Visual and Language Representation for Video Captioning , 2017, ACM Multimedia.

[2]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[3]  Heng Tao Shen,et al.  Deep Region Hashing for Generic Instance Search from Images , 2018, AAAI.

[4]  Richard S. Zemel,et al.  Image Question Answering: A Visual Semantic Embedding Model and a New Dataset , 2015, ArXiv.

[5]  Saurabh Singh,et al.  Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Nicu Sebe,et al.  Graph-without-cut: An Ideal Graph Learning for Image Segmentation , 2016, AAAI.

[7]  Dianhai Yu,et al.  Multi-Task Learning for Multiple Language Translation , 2015, ACL.

[8]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[9]  Tao Mei,et al.  Multi-level Attention Networks for Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[12]  Zi Huang,et al.  Adaptively Attending to Visual Attributes and Linguistic Knowledge for Captioning , 2017, ACM Multimedia.

[13]  Bohyung Han,et al.  Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Nicu Sebe,et al.  Optimal graph learning with partial tags and multiple features for image and video annotation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.

[17]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Xuelong Li,et al.  Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval , 2017, IEEE Transactions on Image Processing.

[20]  Jia Chen,et al.  Video Captioning with Guidance of Multimodal Latent Topics , 2017, ACM Multimedia.

[21]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[22]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yuandong Tian,et al.  Simple Baseline for Visual Question Answering , 2015, ArXiv.

[24]  Lin Ma,et al.  Learning to Answer Questions from Image Using Convolutional Neural Network , 2015, AAAI.

[25]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Jiasen Lu,et al.  VQA: Visual Question Answering , 2015, ICCV.

[27]  Rongrong Ji,et al.  More Than An Answer: Neural Pivot Network for Visual Qestion Answering , 2017, ACM Multimedia.

[28]  Ioannis Patras,et al.  Deep Multi-task Learning with Label Correlation Constraint for Video Concept Detection , 2016, ACM Multimedia.

[29]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[31]  Shuicheng Yan,et al.  A Focused Dynamic Attention Model for Visual Question Answering , 2016, ArXiv.

[32]  Narendra Ahuja,et al.  Robust Visual Tracking via Structured Multi-Task Sparse Learning , 2012, International Journal of Computer Vision.

[33]  Huimin Lu,et al.  Deep adversarial metric learning for cross-modal retrieval , 2019, World Wide Web.

[34]  Yang Yang,et al.  Deep Asymmetric Pairwise Hashing , 2017, ACM Multimedia.

[35]  Wei Zhang,et al.  Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering , 2017, AAAI.

[36]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Wei Liu,et al.  Classification by Retrieval: Binarizing Data and Classifiers , 2017, SIGIR.

[38]  Saumik Bhattacharya,et al.  Visual Saliency Detection Using Spatiotemporal Decomposition , 2018, IEEE Transactions on Image Processing.

[39]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.