论文信息 - VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60\% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The demo shall be released based on https://github.com/OpenGVLab/InternGPT. The code shall be released at https://github.com/OpenGVLab/VisionLLM.

[1] Yi Wang,et al. VideoChat: Chat-Centric Video Understanding , 2023, ArXiv.

[2] Jifeng Dai,et al. InternGPT: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language , 2023, ArXiv.

[3] Mohamed Elhoseiny,et al. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , 2023, ArXiv.

[4] Yong Jae Lee,et al. Visual Instruction Tuning , 2023, ArXiv.

[5] Chunhua Shen,et al. SegGPT: Segmenting Everything In Context , 2023, ArXiv.

[6] Xu Tan,et al. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace , 2023, ArXiv.

[7] Faisal Ahmed,et al. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action , 2023, ArXiv.

[8] Jiannan Wu,et al. Universal Instance Perception as Object Discovery and Retrieval , 2023, ArXiv.

[9] Chenfei Wu,et al. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models , 2023, ArXiv.

[10] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[11] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ArXiv.

[12] Xi Victoria Lin,et al. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization , 2022, ArXiv.

[13] Noah A. Smith,et al. Self-Instruct: Aligning Language Model with Self Generated Instructions , 2022, ArXiv.

[14] Chunhua Shen,et al. Images Speak in Images: A Generalist Painter for In-Context Visual Learning , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Jifeng Dai,et al. Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Hongsheng Li,et al. Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Ledell Yu Wu,et al. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Hongsheng Li,et al. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Andrew M. Dai,et al. Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[20] Chen Change Loy,et al. Unified Vision and Language Prompt Learning , 2022, ArXiv.

[21] P. Zhang,et al. GLM-130B: An Open Bilingual Pre-trained Model , 2022, ICLR.

[22] Alexei A. Efros,et al. Visual Prompting via Image Inpainting , 2022, NeurIPS.

[23] Li Dong,et al. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks , 2022, ArXiv.

[24] Aniruddha Kembhavi,et al. Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks , 2022, ICLR.

[25] David J. Fleet,et al. A Unified Sequence Interface for Vision Tasks , 2022, NeurIPS.

[26] Xiaogang Wang,et al. Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs , 2022, NeurIPS.

[27] Jifeng Dai,et al. Siamese Image Modeling for Self-Supervised Vision Representation Learning , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Zhe Gan,et al. GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..

[29] Jifeng Dai,et al. Vision Transformer Adapter for Dense Predictions , 2022, ICLR.

[30] Sergio Gomez Colmenarejo,et al. A Generalist Agent , 2022, Trans. Mach. Learn. Res..

[31] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[32] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[33] Serge J. Belongie,et al. Visual Prompt Tuning , 2022, ECCV.

[34] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.

[35] Jingren Zhou,et al. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[36] Xizhou Zhu,et al. Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Faisal Ahmed,et al. UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling , 2021, ECCV.

[38] David J. Fleet,et al. Pix2seq: A Language Modeling Framework for Object Detection , 2021, ICLR.

[39] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[40] P. Luo,et al. PVT v2: Improved baselines with Pyramid Vision Transformer , 2021, Computational Visual Media.

[41] Yelong Shen,et al. LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[42] Noah A. Smith,et al. Benchmarking Generalization via In-Context Instructions on 1, 600+ Language Tasks , 2022, ArXiv.

[43] Maosong Sun,et al. CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models , 2021, ArXiv.

[44] Yann LeCun,et al. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[45] Derek Hoiem,et al. Towards General Purpose Vision Systems: An End-to-End Task-Agnostic Vision-Language Architecture , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Heng Wang,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[47] Jaemin Cho,et al. Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.

[48] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[49] Bin Li,et al. Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[50] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[51] Yu Cheng,et al. Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.

[52] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[53] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[54] Ping Luo,et al. PolarMask: Single Shot Instance Segmentation With Polar Representation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[56] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[57] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[58] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[59] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[61] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[62] Vaibhava Goel,et al. Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[64] Licheng Yu,et al. Modeling Context in Referring Expressions , 2016, ECCV.

[65] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66] Alan L. Yuille,et al. Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[68] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[69] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[70] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[72] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.