Multimodal Knowledge Alignment with Reinforcement Learning

Large language models readily adapt to novel settings, even without task-specific training data. Can their zero-shot capacity be extended to multimodal inputs? In this work, we propose ESPER which extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning. Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision: for example, in the image case our reward optimization relies only on cosine similarity derived from CLIP, and thus requires no additional explicitly paired (image, caption) data. Because the parameters of the language model are left unchanged, the model maintains its capacity for zero-shot generalization. Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks; these include a new benchmark we collect+release, ESP dataset, which tasks models with generating several diversely-styled captions for each image.

[1]  Mohit Bansal,et al.  Fine-grained Image Captioning with CLIP Reward , 2022, NAACL-HLT.

[2]  Yejin Choi,et al.  Aligning to Social Norms and Values in Interactive Narratives , 2022, NAACL.

[3]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[4]  Marc-Alexandre Côté,et al.  ScienceWorld: Is your Agent Smarter than a 5th Grader? , 2022, EMNLP.

[5]  Ari S. Morcos,et al.  Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , 2022, ICML.

[6]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[7]  Percy Liang,et al.  Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution , 2022, ICLR.

[8]  Yejin Choi,et al.  MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Yejin Choi,et al.  Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer , 2021, NAACL.

[10]  Lior Wolf,et al.  ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ron Mokady,et al.  ClipCap: CLIP Prefix for Image Captioning , 2021, ArXiv.

[12]  D. Song,et al.  What Would Jiminy Cricket Do? Towards Agents That Behave Morally , 2021, NeurIPS Datasets and Benchmarks.

[13]  J. Bello,et al.  Wav2CLIP: Learning Robust Audio Representations from Clip , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Mark O. Riedl,et al.  Situated Dialogue Learning through Procedural Environment Generation , 2021, ACL.

[15]  Adams Wei Yu,et al.  SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.

[16]  Oriol Vinyals,et al.  Multimodal Few-Shot Learning with Frozen Language Models , 2021, NeurIPS.

[17]  Federico Raue,et al.  Audioclip: Extending Clip to Image, Text and Audio , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Ali Farhadi,et al.  MERLOT: Multimodal Neural Script Knowledge Models , 2021, NeurIPS.

[19]  Gunhee Kim,et al.  Transitional Adaptation of Pretrained Models for Visual Storytelling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Yoshitaka Ushiku,et al.  Removing Word-Level Spurious Alignment between Images and Pseudo-Captions in Unsupervised Image Captioning , 2021, EACL.

[21]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[22]  Idan Schwartz Ensemble of MRR and NDCG models for Visual Dialog , 2021, NAACL.

[23]  Zhilin Yang,et al.  GPT Understands, Too , 2021, AI Open.

[24]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[25]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[26]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[27]  Leonardo Neves,et al.  TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification , 2020, FINDINGS.

[28]  Vicente Ordonez,et al.  Visual News: Benchmark and Challenges in News Image Captioning , 2020, EMNLP.

[29]  Ryan J. Lowe,et al.  Learning to summarize from human feedback , 2020, NeurIPS 2020.

[30]  Tom B. Brown,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[31]  Andrew Zisserman,et al.  Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Li Dong,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[33]  Mark O. Riedl,et al.  Learning Norms from Stories: A Prior for Value Aligned Agents , 2019, AIES.

[34]  Vishvak S. Murahari,et al.  Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline , 2019, ECCV.

[35]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[36]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[37]  Jason J. Corso,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.

[38]  Matthew J. Hausknecht,et al.  Interactive Fiction Games: A Colossal Adventure , 2019, AAAI.

[39]  Yejin Choi,et al.  Counterfactual Story Reasoning and Generation , 2019, EMNLP.

[40]  Christopher Joseph Pal,et al.  Interactive Language Learning by Question Answering , 2019, EMNLP.

[41]  Nassir Navab,et al.  Towards Unsupervised Image Captioning With Shared Multimodal Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[43]  Jason Weston,et al.  Neural Text Generation with Unlikelihood Training , 2019, ICLR.

[44]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[45]  Gunhee Kim,et al.  AudioCaps: Generating Captions for Audios in The Wild , 2019, NAACL.

[46]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[47]  Dan Jurafsky,et al.  Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts , 2019, EMNLP.

[48]  Dimosthenis Karatzas,et al.  Good News, Everyone! Context Driven Entity-Aware Captioning for News Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Yang Feng,et al.  Unsupervised Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Jason Weston,et al.  Engaging Image Captioning via Personality , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  William Yang Wang,et al.  WikiHow: A Large Scale Text Summarization Dataset , 2018, ArXiv.

[52]  Mark O. Riedl,et al.  Controllable Neural Story Plot Generation via Reinforcement Learning , 2018 .

[53]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[54]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[55]  Zhe Gan,et al.  StyleNet: Generating Attractive Visual Captions with Styles , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[57]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[59]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Francis Ferraro,et al.  Visual Storytelling , 2016, NAACL.

[62]  Nathanael Chambers,et al.  A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories , 2016, NAACL.

[63]  S. Chopra,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[64]  Lexing Xie,et al.  SentiCap: Generating Image Descriptions with Sentiments , 2015, AAAI.

[65]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[66]  Gunhee Kim,et al.  Joint photo stream and blog post summarization and exploration , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[68]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[69]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[72]  A. Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[73]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[74]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[75]  Luca Paolini,et al.  Models , 2021, Encyclopedia of Gerontology and Population Aging.

[76]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[77]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[78]  Mark O. Riedl,et al.  Improvisational Storytelling Agents , 2017 .

[79]  Alec Go,et al.  Twitter Sentiment Classification using Distant Supervision , 2009 .

[80]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.