Musketeer (All for One, and One for All): A Generalist Vision-Language Model with Task Explanation Prompts

We present a sequence-to-sequence vision-language model whose parameters are jointly trained on all tasks (all for one) and fully shared among multiple tasks (one for all), resulting in a single model which we named Musketeer. The integration of knowledge across heterogeneous tasks is enabled by a novel feature called Task Explanation Prompt (TEP). TEP reduces interference among tasks, allowing the model to focus on their shared structure. With a single model, Musketeer achieves results comparable to or better than strong baselines trained on single tasks, almost uniformly across multiple tasks.

[1]  S. Savarese,et al.  BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[2]  Hongsheng Li,et al.  Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Z. Tu,et al.  On the Feasibility of Cross-Task Transfer with Model-Based Reinforcement Learning , 2023, ICLR.

[4]  Aniruddha Kembhavi,et al.  Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks , 2022, ICLR.

[5]  David J. Fleet,et al.  A Unified Sequence Interface for Vision Tasks , 2022, NeurIPS.

[6]  Ming Zhou,et al.  ProQA: Structural Prompt-based Pre-training for Unified Question Answering , 2022, NAACL.

[7]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[8]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[9]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[10]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[11]  Marcus Rohrbach,et al.  FLAVA: A Foundational Language And Vision Alignment Model , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Xizhou Zhu,et al.  Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Faisal Ahmed,et al.  UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling , 2021, ECCV.

[14]  Sanket Vaibhav Mehta,et al.  ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning , 2021, ArXiv.

[15]  Lu Yuan,et al.  Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[16]  Zhe Gan,et al.  UFO: A UniFied TransfOrmer for Vision-Language Representation Learning , 2021, ArXiv.

[17]  Peng Gao,et al.  Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling , 2021, ArXiv.

[18]  Peng Gao,et al.  CLIP-Adapter: Better Vision-Language Models with Feature Adapters , 2021, Int. J. Comput. Vis..

[19]  David J. Fleet,et al.  Pix2seq: A Language Modeling Framework for Object Detection , 2021, ICLR.

[20]  Jason Baldridge,et al.  MURAL: Multimodal, Multitask Retrieval Across Languages , 2021, ArXiv.

[21]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[22]  Adams Wei Yu,et al.  SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.

[23]  Maosong Sun,et al.  Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification , 2021, ACL.

[24]  Olivier J. H'enaff,et al.  Perceiver IO: A General Architecture for Structured Inputs & Outputs , 2021, ICLR.

[25]  Tao Qin,et al.  R-Drop: Regularized Dropout for Neural Networks , 2021, NeurIPS.

[26]  Oriol Vinyals,et al.  Multimodal Few-Shot Learning with Frozen Language Models , 2021, NeurIPS.

[27]  Quoc V. Le,et al.  CoAtNet: Marrying Convolution and Attention for All Data Sizes , 2021, NeurIPS.

[28]  Andrew Zisserman,et al.  Perceiver: General Perception with Iterative Attention , 2021, ICML.

[29]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[30]  Ronghang Hu,et al.  UniT: Multimodal Multitask Learning with a Unified Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[32]  Jaemin Cho,et al.  Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.

[33]  Jean-Baptiste Alayrac,et al.  Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers , 2021, Transactions of the Association for Computational Linguistics.

[34]  Sonal Gupta,et al.  Muppet: Massive Multi-task Representations with Pre-Finetuning , 2021, EMNLP.

[35]  Stefano Soatto,et al.  Structured Prediction as Translation between Augmented Natural Languages , 2021, ICLR.

[36]  Hinrich Schütze,et al.  It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.

[37]  Andrew Zisserman,et al.  Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.

[38]  Tie-Yan Liu,et al.  Rethinking Positional Encoding in Language Pre-training , 2020, ICLR.

[39]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[40]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[41]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[42]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[43]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[44]  Asim Kadav,et al.  Visual Entailment: A Novel Task for Fine-Grained Image Understanding , 2019, ArXiv.

[45]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[46]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[47]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[48]  Lukasz Kaiser,et al.  One Model To Learn Them All , 2017, ArXiv.

[49]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[50]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[51]  Philipp Koehn,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016 .

[52]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[53]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[56]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[57]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[58]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[59]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[60]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[61]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[62]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[63]  Zhe Gan,et al.  Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling , 2021, ArXiv.

[64]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[65]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[66]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.