IQ-VQA: Intelligent Visual Question Answering

Even though there has been tremendous progress in the field of Visual Question Answering, models today still tend to be inconsistent and brittle. To this end, we propose a model-independent cyclic framework which increases consistency and robustness of any VQA architecture. We train our models to answer the original question, generate an implication based on the answer and then also learn to answer the generated implication correctly. As a part of the cyclic framework, we propose a novel implication generator which can generate implied questions from any question-answer pair. As a baseline for future works on consistency, we provide a new human annotated VQA-Implications dataset. The dataset consists of ~30k questions containing implications of 3 types - Logical Equivalence, Necessary Condition and Mutual Exclusion - made from the VQA v2.0 validation dataset. We show that our framework improves consistency of VQA models by ~15% on the rule-based dataset, ~7% on VQA-Implications dataset and robustness by ~2%, without degrading their performance. In addition, we also quantitatively show improvement in attention maps which highlights better multi-modal understanding of vision and language.

[1]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[2]  Ming Zhou,et al.  Learning to Collaborate for Question Answering and Asking , 2018, NAACL.

[3]  Kurt Keutzer,et al.  Dense Point Trajectories by GPU-Accelerated Large Displacement Optical Flow , 2010, ECCV.

[4]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[5]  Yash Goyal,et al.  Yin and Yang: Balancing and Answering Binary Visual Questions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Xinlei Chen,et al.  Pythia v0.1: the Winning Entry to the VQA Challenge 2018 , 2018, ArXiv.

[7]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[8]  Chitta Baral,et al.  VQA-LOL: Visual Question Answering under the Lens of Logic , 2020, ECCV.

[9]  Jiusheng Chen,et al.  ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training , 2020, EMNLP.

[10]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Eric Horvitz,et al.  SQuINTing at VQA Models: Introspecting VQA Models With Sub-Questions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[14]  Xinlei Chen,et al.  Cycle-Consistency for Robust Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Ganesh Ramakrishnan,et al.  Putting the Horse before the Cart: A Generator-Evaluator Framework for Question Generation from Text , 2018, CoNLL.

[17]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[18]  Dhruv Batra,et al.  Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Sameer Singh,et al.  Are Red Roses Red? Evaluating Consistency of Question-Answering Models , 2019, ACL.

[20]  Ajay Divakaran,et al.  Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation , 2019, EMNLP.

[21]  Mitesh M. Khapra,et al.  Generating Natural Language Question-Answer Pairs from a Knowledge Graph Using a RNN Based Question Generation Model , 2017, EACL.

[22]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[23]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[24]  Larry S. Davis,et al.  Explicit Bias Discovery in Visual Question Answering Models , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.