VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
暂无分享,去创建一个
Jiannan Wu | Ping Luo | Jifeng Dai | Y. Qiao | Xizhou Zhu | Tong Lu | Xiaokang Chen | Zhe Chen | Wen Wang | Jie Zhou | Gang Zeng
[1] Yi Wang,et al. VideoChat: Chat-Centric Video Understanding , 2023, ArXiv.
[2] Jifeng Dai,et al. InternGPT: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language , 2023, ArXiv.
[3] Mohamed Elhoseiny,et al. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , 2023, ArXiv.
[4] Yong Jae Lee,et al. Visual Instruction Tuning , 2023, ArXiv.
[5] Chunhua Shen,et al. SegGPT: Segmenting Everything In Context , 2023, ArXiv.
[6] Xu Tan,et al. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace , 2023, ArXiv.
[7] Faisal Ahmed,et al. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action , 2023, ArXiv.
[8] Jiannan Wu,et al. Universal Instance Perception as Object Discovery and Retrieval , 2023, ArXiv.
[9] Chenfei Wu,et al. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models , 2023, ArXiv.
[10] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.
[11] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ArXiv.
[12] Xi Victoria Lin,et al. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization , 2022, ArXiv.
[13] Noah A. Smith,et al. Self-Instruct: Aligning Language Model with Self Generated Instructions , 2022, ArXiv.
[14] Chunhua Shen,et al. Images Speak in Images: A Generalist Painter for In-Context Visual Learning , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Jifeng Dai,et al. Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Hongsheng Li,et al. Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Ledell Yu Wu,et al. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Hongsheng Li,et al. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[19] Andrew M. Dai,et al. Scaling Instruction-Finetuned Language Models , 2022, ArXiv.
[20] Chen Change Loy,et al. Unified Vision and Language Prompt Learning , 2022, ArXiv.
[21] P. Zhang,et al. GLM-130B: An Open Bilingual Pre-trained Model , 2022, ICLR.
[22] Alexei A. Efros,et al. Visual Prompting via Image Inpainting , 2022, NeurIPS.
[23] Li Dong,et al. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks , 2022, ArXiv.
[24] Aniruddha Kembhavi,et al. Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks , 2022, ICLR.
[25] David J. Fleet,et al. A Unified Sequence Interface for Vision Tasks , 2022, NeurIPS.
[26] Xiaogang Wang,et al. Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs , 2022, NeurIPS.
[27] Jifeng Dai,et al. Siamese Image Modeling for Self-Supervised Vision Representation Learning , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Zhe Gan,et al. GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..
[29] Jifeng Dai,et al. Vision Transformer Adapter for Dense Predictions , 2022, ICLR.
[30] Sergio Gomez Colmenarejo,et al. A Generalist Agent , 2022, Trans. Mach. Learn. Res..
[31] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.
[32] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.
[33] Serge J. Belongie,et al. Visual Prompt Tuning , 2022, ECCV.
[34] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.
[35] Jingren Zhou,et al. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.
[36] Xizhou Zhu,et al. Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[37] Faisal Ahmed,et al. UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling , 2021, ECCV.
[38] David J. Fleet,et al. Pix2seq: A Language Modeling Framework for Object Detection , 2021, ICLR.
[39] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.
[40] P. Luo,et al. PVT v2: Improved baselines with Pyramid Vision Transformer , 2021, Computational Visual Media.
[41] Yelong Shen,et al. LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.
[42] Noah A. Smith,et al. Benchmarking Generalization via In-Context Instructions on 1, 600+ Language Tasks , 2022, ArXiv.
[43] Maosong Sun,et al. CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models , 2021, ArXiv.
[44] Yann LeCun,et al. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[45] Derek Hoiem,et al. Towards General Purpose Vision Systems: An End-to-End Task-Agnostic Vision-Language Architecture , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[46] Heng Wang,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.
[47] Jaemin Cho,et al. Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.
[48] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[49] Bin Li,et al. Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.
[50] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[51] Yu Cheng,et al. Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.
[52] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[53] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.
[54] Ping Luo,et al. PolarMask: Single Shot Instance Segmentation With Polar Representation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[55] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[56] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.
[57] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[58] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[59] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[60] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[61] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[62] Vaibhava Goel,et al. Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[63] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.
[64] Licheng Yu,et al. Modeling Context in Referring Expressions , 2016, ECCV.
[65] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[66] Alan L. Yuille,et al. Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[67] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[68] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[69] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[70] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[71] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[72] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.