Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training

Visual question answering (VQA) is a hallmark of vision and language reasoning and a challenging task under the zero-shot setting. We propose Plug-and-Play VQA (PNP-VQA), a modular framework for zero-shot VQA. In contrast to most existing works, which require substantial adaptation of pretrained language models (PLMs) for the vision modality, PNP-VQA requires no additional training of the PLMs. Instead, we propose to use natural language and network interpretation as an intermediate representation that glues pretrained models together. We first generate question-guided informative image captions, and pass the captions to a PLM as context for question answering. Surpassing end-to-end trained baselines, PNP-VQA achieves state-of-the-art results on zero-shot VQAv2 and GQA. With 11B parameters, it outperforms the 80B-parameter Flamingo model by 8.5% on VQAv2. With 738M PLM parameters, PNP-VQA achieves an improvement of 9.1% on GQA over FewVLM with 740M PLM parameters. Code is released at https://github.com/salesforce/LAVIS/tree/main/projects/pnp-vqa

[1]  S. Savarese,et al.  LAVIS: A Library for Language-Vision Intelligence , 2022, ArXiv.

[2]  D. Schuurmans,et al.  Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , 2022, ICLR.

[3]  Radu Soricut,et al.  All You May Need for VQA are Image Captions , 2022, NAACL.

[4]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[5]  Stella Rose Biderman,et al.  GPT-NeoX-20B: An Open-Source Autoregressive Language Model , 2022, BIGSCIENCE.

[6]  Adrian S. Wong,et al.  Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language , 2022, ICLR.

[7]  Qun Liu,et al.  Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation , 2022, FINDINGS.

[8]  Hannaneh Hajishirzi,et al.  UnifiedQA-v2: Stronger Generalization via Broader Cross-Format Training , 2022, ArXiv.

[9]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[10]  Swarat Chaudhuri,et al.  Natural Language Deduction through Search over Statement Compositions , 2022, EMNLP.

[11]  Mohit Bansal,et al.  VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  A. Frank,et al.  MAGMA - Multimodal Augmentation of Generative Models through Adapter-based Finetuning , 2021, EMNLP.

[13]  Hang Li,et al.  Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts , 2021, ICML.

[14]  Weizhu Chen,et al.  A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models , 2021, ACL.

[15]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[16]  Carrie J. Cai,et al.  AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts , 2021, CHI.

[17]  Zhe Gan,et al.  An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA , 2021, AAAI.

[18]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[19]  Adams Wei Yu,et al.  SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.

[20]  Mohamed Elhoseiny,et al.  VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Wai Keen Vong,et al.  Few-shot image classification by generating natural language rules , 2022 .

[22]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[23]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[24]  Oriol Vinyals,et al.  Multimodal Few-Shot Learning with Frozen Language Models , 2021, NeurIPS.

[25]  Kenneth Ward Church,et al.  On Attention Redundancy: A Comprehensive Study , 2021, NAACL.

[26]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[27]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[28]  Jaemin Cho,et al.  Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.

[29]  Hua Wu,et al.  UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning , 2020, ACL.

[30]  Tejas Gokhale,et al.  WeaQA: Weak Supervision via Captions for Visual Question Answering , 2020, FINDINGS.

[31]  Edouard Grave,et al.  Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , 2020, EACL.

[32]  A. Linear-probe,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021 .

[33]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[34]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[35]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[36]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[37]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[38]  Joelle Pineau,et al.  Language GANs Falling Short , 2018, ICLR.

[39]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[40]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[41]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[42]  Ali Farhadi,et al.  OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[45]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[46]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.

[47]  Dan Klein,et al.  Learning with Latent Language , 2017, NAACL.

[48]  Yoav Goldberg,et al.  Controlling Linguistic Style Aspects in Neural Language Generation , 2017, ArXiv.

[49]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[50]  Ashwin K. Vijayakumar,et al.  Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , 2016, ArXiv.

[51]  B T Thomas Yeo,et al.  The modular and integrative functional architecture of the human brain , 2015, Proceedings of the National Academy of Sciences.

[52]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[53]  S. Shettleworth Modularity, comparative cognition and human uniqueness , 2012, Philosophical Transactions of the Royal Society B: Biological Sciences.

[54]  J. Fodor The Modularity of mind. An essay on faculty psychology , 1986 .