MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting

Large pre-trained models have proved to be remarkable zero- and (prompt-based) few-shot learners in unimodal vision and language tasks. We propose MAPL, a simple and parameter-efficient method that reuses frozen pre-trained unimodal models and leverages their strong generalization capabilities in multimodal vision-language (VL) settings. MAPL learns a lightweight mapping between the representation spaces of unimodal models using aligned image-text data, and can generalize to unseen VL tasks from just a few in-context examples. The small number of trainable parameters makes MAPL effective at low-data and in-domain learning. Moreover, MAPL’s modularity enables easy extension to other pre-trained models. Extensive experiments on several visual question answering and image captioning benchmarks show that MAPL achieves superior or competitive performance compared to similar methods while training orders of magnitude fewer parameters. MAPL can be trained in just a few hours using modest computational resources and public datasets. We release our code and pre-trained model weights at https://github.com/oscmansan/mapl.

[1]  P. Blunsom,et al.  Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization , 2022, FINDINGS.

[2]  Ellie Pavlick,et al.  Linearly Mapping from Image to Text Space , 2022, ICLR.

[3]  Li Dong,et al.  Language Models are General-Purpose Interfaces , 2022, ArXiv.

[4]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[5]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[6]  Qun Liu,et al.  Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation , 2022, FINDINGS.

[7]  A. Frank,et al.  MAGMA - Multimodal Augmentation of Generative Models through Adapter-based Finetuning , 2021, EMNLP.

[8]  Weizhu Chen,et al.  A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models , 2021, ACL.

[9]  Zhe Gan,et al.  An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA , 2021, AAAI.

[10]  Adams Wei Yu,et al.  SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.

[11]  Mohamed Elhoseiny,et al.  VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Rongsheng Zhang,et al.  VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training , 2022, ArXiv.

[13]  Ron Mokady,et al.  ClipCap: CLIP Prefix for Image Captioning , 2021, ArXiv.

[14]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[15]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[16]  Oriol Vinyals,et al.  Multimodal Few-Shot Learning with Frozen Language Models , 2021, NeurIPS.

[17]  Yejin Choi,et al.  VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[19]  Laria Reynolds,et al.  Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm , 2021, CHI Extended Abstracts.

[20]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[21]  Jaemin Cho,et al.  Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.

[22]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[23]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[24]  Hinrich Schütze,et al.  It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.

[25]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[26]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[27]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[28]  Marcus Rohrbach,et al.  TextCaps: a Dataset for Image Captioning with Reading Comprehension , 2020, ECCV.

[29]  Samyam Rajbhandari,et al.  ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Ahmed El Kholy,et al.  UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.

[31]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[32]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[33]  Ali Farhadi,et al.  OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Xinlei Chen,et al.  Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Yoav Artzi,et al.  A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.

[36]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[37]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[38]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[39]  Jiebo Luo,et al.  VizWiz Grand Challenge: Answering Visual Questions from Blind People , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[41]  Nir Shavit,et al.  Deep Learning is Robust to Massive Label Noise , 2017, ArXiv.

[42]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[43]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Yoshua Bengio,et al.  Professor Forcing: A New Algorithm for Training Recurrent Networks , 2016, NIPS.

[46]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[47]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[48]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .