论文信息 - Grounding Language Models to Images for Multimodal Inputs and Outputs

Grounding Language Models to Images for Multimodal Inputs and Outputs

We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings.

Jing Yu Koh | R. Salakhutdinov | Daniel Fried

[1] Xiang Lisa Li,et al. Contrastive Decoding: Open-ended Text Generation as Optimization , 2022, ArXiv.

[2] Quoc V. Le,et al. Transcending Scaling Laws with 0.1% Extra Compute , 2022, EMNLP.

[3] Andrew M. Dai,et al. Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[4] D. Klein,et al. Re3: Generating Longer Stories With Recursive Reprompting and Revision , 2022, EMNLP.

[5] Ellie Pavlick,et al. Linearly Mapping from Image to Text Space , 2022, ICLR.

[6] M. Lewis,et al. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , 2022, ArXiv.

[7] Jing Yu Koh,et al. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[8] J. Dean,et al. Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[9] Ronan Le Bras,et al. Multimodal Knowledge Alignment with Reinforcement Learning , 2022, arXiv.org.

[10] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[11] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[12] Jie Tang,et al. CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers , 2022, NeurIPS.

[13] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[14] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.

[15] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.

[16] Jingren Zhou,et al. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[17] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[18] Reza Yazdani Aminabadi,et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.

[19] Dmytro Okhonko,et al. CM3: A Causal Masked Multimodal Model of the Internet , 2022, ArXiv.

[20] A. Frank,et al. MAGMA - Multimodal Augmentation of Generative Models through Adapter-based Finetuning , 2021, EMNLP.

[21] Po-Sen Huang,et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[22] Jenia Jitsev,et al. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.

[23] Jing Yu Koh,et al. Vector-quantized Image Modeling with Improved VQGAN , 2021, ICLR.

[24] Vinay Uday Prabhu,et al. Multimodal datasets: misogyny, pornography, and malignant stereotypes , 2021, ArXiv.

[25] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[26] Michael S. Bernstein,et al. On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[27] Oriol Vinyals,et al. Multimodal Few-Shot Learning with Frozen Language Models , 2021, NeurIPS.

[28] Eric Xing,et al. Progressive Generation of Long Text with Pretrained Language Models , 2021, NAACL.

[29] Brian Lester,et al. The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[30] P. Abbeel,et al. Pretrained Transformers as Universal Computation Engines , 2021, ArXiv.

[31] Emily M. Bender,et al. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[32] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[33] Alec Radford,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.

[34] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[35] B. Ommer,et al. Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Yejin Choi,et al. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[37] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[38] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[39] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[40] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.

[41] Mona Attariyan,et al. Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[42] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[43] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[44] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[45] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[46] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[47] Bernt Schiele,et al. Generative Adversarial Text to Image Synthesis , 2016, ICML.