WeaQA: Weak Supervision via Captions for Visual Question Answering

Methodologies for training visual question answering (VQA) models assume the availability of datasets with human-annotated ImageQuestion-Answer (I-Q-A) triplets. This has led to heavy reliance on datasets and a lack of generalization to new types of questions and scenes. Linguistic priors along with biases and errors due to annotator subjectivity have been shown to percolate into VQA models trained on such samples. We study whether models can be trained without any human-annotated Q-A pairs, but only with images and their associated textual descriptions or captions. We present a method to train models with synthetic Q-A pairs generated procedurally from captions. Additionally, we demonstrate the efficacy of spatial-pyramid image patches as a simple but effective alternative to dense and costly object bounding box annotations used in existing VQA models. Our experiments on three VQA benchmarks demonstrate the efficacy of this weakly-supervised approach, especially on the VQA-CP challenge, which tests performance under changing linguistic priors.

[1]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[2]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[3]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[5]  Christopher Kanan,et al.  An Analysis of Visual Question Answering Algorithms , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[7]  Ludovic Denoyer,et al.  Unsupervised Question Answering by Cloze Translation , 2019, ACL.

[8]  Zhou Yu,et al.  Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Gaurav Sharma,et al.  An Empirical Evaluation of Visual Question Answering for Novel Objects , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[11]  Luke Zettlemoyer,et al.  Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases , 2019, EMNLP.

[12]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[14]  Vaibhava Goel,et al.  Unsupervised Adaptation of Question Answering Systems via Generative Self-training , 2020, EMNLP.

[15]  Anton van den Hengel,et al.  Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision , 2020, ECCV.

[16]  Yonatan Belinkov,et al.  Adversarial Regularization for Visual Question Answering: Strengths, Shortcomings, and Side Effects , 2019, Proceedings of the Second Workshop on Shortcomings in Vision and Language.

[17]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[18]  Chitta Baral,et al.  Self-Supervised Knowledge Triplet Learning for Zero-shot Question Answering , 2020, EMNLP.

[19]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[20]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[21]  Zaïd Harchaoui,et al.  On learning to localize objects with minimal supervision , 2014, ICML.

[22]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[23]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[26]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[27]  Deva Ramanan,et al.  Meta-Learning to Detect Rare Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Luke S. Zettlemoyer,et al.  Question-Answer Driven Semantic Role Labeling: Using Natural Language to Annotate Natural Language , 2015, EMNLP.

[29]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Chitta Baral,et al.  MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering , 2020, EMNLP.

[31]  Shih-Fu Chang,et al.  PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Anton van den Hengel,et al.  Unshuffling Data for Improved Generalization in Visual Question Answering , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Anton van den Hengel,et al.  Actively Seeking and Learning From Live Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[35]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[36]  Neil D. Lawrence,et al.  When Training and Test Sets Are Different: Characterizing Learning Transfer , 2009 .

[37]  Hung-Yu Kao,et al.  Probing Neural Network Comprehension of Natural Language Arguments , 2019, ACL.

[38]  Ramesh Nallapati,et al.  Template-Based Question Generation from Retrieved Sentences for Improved Unsupervised Question Answering , 2020, ACL.

[39]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[40]  Chenxi Liu,et al.  CLEVR-Ref+: Diagnosing Visual Reasoning With Referring Expressions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Chitta Baral,et al.  Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning , 2020, EMNLP.

[42]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Shiliang Pu,et al.  Counterfactual Samples Synthesizing for Robust Visual Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[45]  Sameer Singh,et al.  Are Red Roses Red? Evaluating Consistency of Question-Answering Models , 2019, ACL.

[46]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[47]  Dhruv Batra,et al.  Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Luke S. Zettlemoyer,et al.  Large-Scale QA-SRL Parsing , 2018, ACL.

[49]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[50]  Anton van den Hengel,et al.  On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law , 2020, NeurIPS.

[51]  Eduard Hovy,et al.  Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2020, ICLR.

[52]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Raymond J. Mooney,et al.  Self-Critical Reasoning for Robust Visual Question Answering , 2019, NeurIPS.

[54]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[55]  Amit K. Roy-Chowdhury,et al.  Weakly Supervised Video Moment Retrieval From Text Queries , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[57]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[58]  Shu Kong,et al.  Weak Supervision and Referring Attention for Temporal-Textual Association Learning , 2020, ArXiv.

[59]  Xinya Du,et al.  Learning to Ask: Neural Question Generation for Reading Comprehension , 2017, ACL.

[60]  Eric Horvitz,et al.  SQuINTing at VQA Models: Introspecting VQA Models With Sub-Questions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Christopher Kanan,et al.  Answer Them All! Toward Universal Visual Question Answering Models , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[63]  Bernt Schiele,et al.  Simple Does It: Weakly Supervised Instance and Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Michael S. Bernstein,et al.  Information Maximizing Visual Question Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[66]  Yash Goyal,et al.  Yin and Yang: Balancing and Answering Binary Visual Questions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Chitta Baral,et al.  VQA-LOL: Visual Question Answering under the Lens of Logic , 2020, ECCV.

[68]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[69]  Jiebo Luo,et al.  Open-Ended Visual Question Answering by Multi-Modal Domain Adaptation , 2019, FINDINGS.

[70]  Dhruv Batra,et al.  C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset , 2017, ArXiv.

[71]  Qing Li,et al.  Why Does a Visual Question Have Different Answers? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[72]  Trevor Darrell,et al.  Women also Snowboard: Overcoming Bias in Captioning Models , 2018, ECCV.

[73]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[74]  Ajay Divakaran,et al.  Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation , 2019, EMNLP.

[75]  Chitta Baral,et al.  Self-Supervised Test-Time Learning for Reading Comprehension , 2021, NAACL.

[76]  Yejin Choi,et al.  WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale , 2020, AAAI.

[77]  Mario Fritz,et al.  Towards a Visual Turing Challenge , 2014, ArXiv.

[78]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[79]  Ahmed El Kholy,et al.  UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.

[80]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[82]  Matthieu Cord,et al.  RUBi: Reducing Unimodal Biases in Visual Question Answering , 2019, NeurIPS.

[83]  Ke Xu,et al.  Harvesting and Refining Question-Answer Pairs for Unsupervised QA , 2020, ACL.

[84]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Dietrich Klakow,et al.  Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods , 2019, J. Artif. Intell. Res..

[86]  Bolei Zhou,et al.  Visual Question Generation as Dual Task of Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.