Musketeer (All for One, and One for All): A Generalist Vision-Language Model with Task Explanation Prompts
暂无分享,去创建一个
Z. Tu | Davide Modolo | Zhaowei Cai | Hao Yang | Yantao Shen | S. Soatto | Siqi Deng | Zhaoyang Zhang | Kunyu Shi | Jun Fang
[1] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.
[2] Hongsheng Li,et al. Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Z. Tu,et al. On the Feasibility of Cross-Task Transfer with Model-Based Reinforcement Learning , 2023, ICLR.
[4] Aniruddha Kembhavi,et al. Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks , 2022, ICLR.
[5] David J. Fleet,et al. A Unified Sequence Interface for Vision Tasks , 2022, NeurIPS.
[6] Ming Zhou,et al. ProQA: Structural Prompt-based Pre-training for Unified Question Answering , 2022, NAACL.
[7] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.
[8] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.
[9] Jingren Zhou,et al. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.
[10] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.
[11] Marcus Rohrbach,et al. FLAVA: A Foundational Language And Vision Alignment Model , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[12] Xizhou Zhu,et al. Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[13] Faisal Ahmed,et al. UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling , 2021, ECCV.
[14] Sanket Vaibhav Mehta,et al. ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning , 2021, ArXiv.
[15] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.
[16] Zhe Gan,et al. UFO: A UniFied TransfOrmer for Vision-Language Representation Learning , 2021, ArXiv.
[17] Peng Gao,et al. Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling , 2021, ArXiv.
[18] Peng Gao,et al. CLIP-Adapter: Better Vision-Language Models with Feature Adapters , 2021, Int. J. Comput. Vis..
[19] David J. Fleet,et al. Pix2seq: A Language Modeling Framework for Object Detection , 2021, ICLR.
[20] Jason Baldridge,et al. MURAL: Multimodal, Multitask Retrieval Across Languages , 2021, ArXiv.
[21] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.
[22] Adams Wei Yu,et al. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.
[23] Maosong Sun,et al. Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification , 2021, ACL.
[24] Olivier J. H'enaff,et al. Perceiver IO: A General Architecture for Structured Inputs & Outputs , 2021, ICLR.
[25] Tao Qin,et al. R-Drop: Regularized Dropout for Neural Networks , 2021, NeurIPS.
[26] Oriol Vinyals,et al. Multimodal Few-Shot Learning with Frozen Language Models , 2021, NeurIPS.
[27] Quoc V. Le,et al. CoAtNet: Marrying Convolution and Attention for All Data Sizes , 2021, NeurIPS.
[28] Andrew Zisserman,et al. Perceiver: General Perception with Iterative Attention , 2021, ICML.
[29] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[30] Ronghang Hu,et al. UniT: Multimodal Multitask Learning with a Unified Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[31] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[32] Jaemin Cho,et al. Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.
[33] Jean-Baptiste Alayrac,et al. Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers , 2021, Transactions of the Association for Computational Linguistics.
[34] Sonal Gupta,et al. Muppet: Massive Multi-task Representations with Pre-Finetuning , 2021, EMNLP.
[35] Stefano Soatto,et al. Structured Prediction as Translation between Augmented Natural Languages , 2021, ICLR.
[36] Hinrich Schütze,et al. It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.
[37] Andrew Zisserman,et al. Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.
[38] Tie-Yan Liu,et al. Rethinking Positional Encoding in Language Pre-training , 2020, ICLR.
[39] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[40] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.
[41] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[42] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[43] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[44] Asim Kadav,et al. Visual Entailment: A Novel Task for Fine-Grained Image Understanding , 2019, ArXiv.
[45] Richard Socher,et al. The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.
[46] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.
[47] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[48] Lukasz Kaiser,et al. One Model To Learn Them All , 2017, ArXiv.
[49] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[50] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.
[51] Philipp Koehn,et al. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016 .
[52] Licheng Yu,et al. Modeling Context in Referring Expressions , 2016, ECCV.
[53] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[54] Alan L. Yuille,et al. Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[55] Jason Weston,et al. A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.
[56] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.
[57] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[58] Alex Graves,et al. Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.
[59] Geoffrey E. Hinton,et al. Generating Text with Recurrent Neural Networks , 2011, ICML.
[60] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[61] Xin-She Yang,et al. Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.
[62] Percy Liang,et al. Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.
[63] Zhe Gan,et al. Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling , 2021, ArXiv.
[64] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[65] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[66] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.