Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions
暂无分享,去创建一个
Siliang Tang | Wei Ji | Zhiqi Ge | Wenqiao Zhang | Kaihang Pan | Minghe Gao | Tat-Seng Chua | Juncheng Li | Hanwang Zhang | Yueting Zhuang
[1] Eric Michael Smith,et al. Llama 2: Open Foundation and Fine-Tuned Chat Models , 2023, ArXiv.
[2] Yunhang Shen,et al. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , 2023, ArXiv.
[3] Tong Xu,et al. A Survey on Multimodal Large Language Models , 2023, ArXiv.
[4] Wenqi Shao,et al. LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models , 2023, ArXiv.
[5] Oncel Tuzel,et al. VISION Datasets: A Benchmark for Vision-based InduStrial InspectiON , 2023, ArXiv.
[6] C. Li,et al. MIMIC-IT: Multi-Modal In-Context Instruction Tuning , 2023, ArXiv.
[7] Judy Hoffman,et al. LANCE: Stress-testing Visual Models by Generating Language-guided Counterfactual Images , 2023, ArXiv.
[8] Siliang Tang,et al. Interactive Data Synthesis for Systematic Vision Adaptation via LLMs-AIGCs Collaboration , 2023, ArXiv.
[9] Boyang Li,et al. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , 2023, NeurIPS.
[10] Hongsheng Li,et al. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model , 2023, ArXiv.
[11] Ming Yan,et al. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality , 2023, ArXiv.
[12] Mohamed Elhoseiny,et al. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , 2023, ArXiv.
[13] Yong Jae Lee,et al. Visual Instruction Tuning , 2023, ArXiv.
[14] Ross B. Girshick,et al. Segment Anything , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).
[15] Xingyu Zeng,et al. Explore the Power of Synthetic Data on Few-shot Object Detection , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[16] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.
[17] Maneesh Agrawala,et al. Adding Conditional Control to Text-to-Image Diffusion Models , 2023, ArXiv.
[18] Aditya Grover,et al. Leaving Reality to Imagination: Robust Classification via Generated Datasets , 2023, ArXiv.
[19] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.
[20] Taku Hasegawa,et al. SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images , 2023, AAAI.
[21] Noah A. Smith,et al. Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.
[22] Ledell Yu Wu,et al. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Andrew M. Dai,et al. Scaling Instruction-Finetuned Language Models , 2022, ArXiv.
[24] Philip H. S. Torr,et al. Is synthetic data from generative models ready for image recognition? , 2022, ICLR.
[25] Mohit Bansal,et al. StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation , 2022, ECCV.
[26] Lingxi Xie,et al. Fine-Grained Semantically Aligned Vision-Language Pre-Training , 2022, NeurIPS.
[27] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.
[28] D. Lischinski,et al. Blended Diffusion for Text-driven Editing of Natural Images , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Jonathan Brandt,et al. AESOP: Abstract Encoding of Stories, Objects, and Pictures , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[30] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.
[31] Yonatan Bisk,et al. WebQA: Multihop and Multimodal QA , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Yelong Shen,et al. LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.
[33] Yang Wang,et al. Image Change Captioning by Learning from an Auxiliary Task , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[34] Jonathan Berant,et al. MultiModalQA: Complex Question Answering over Text, Tables and Images , 2021, ICLR.
[35] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[36] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[37] Rama Chellappa,et al. Visual Question Answering on Image Sets , 2020, ECCV.
[38] C. V. Jawahar,et al. DocVQA: A Dataset for VQA on Document Images , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).
[39] Mohit Bansal,et al. ManyModalQA: Modality Disambiguation and QA over Diverse Inputs , 2020, AAAI.
[40] Luke Zettlemoyer,et al. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Serge J. Belongie,et al. Neural Naturalist: Generating Fine-Grained Image Comparisons , 2019, EMNLP.
[42] Shashank Shekhar,et al. OCR-VQA: Visual Question Answering by Reading Text in Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).
[43] Iryna Gurevych,et al. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.
[44] Qing Li,et al. Why Does a Visual Question Have Different Answers? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[45] Franck Dernoncourt,et al. Expressing Visual Relationships via Language , 2019, ACL.
[46] Qiang Xu,et al. nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[47] Yu Cheng,et al. StoryGAN: A Sequential Conditional GAN for Story Visualization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[48] Yoav Artzi,et al. A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.
[49] Nazli Ikizler-Cinbis,et al. RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes , 2018, EMNLP.
[50] Harsh Jhamtani,et al. Learning to Describe Differences Between Pairs of Similar Images , 2018, EMNLP.
[51] Jitendra Malik,et al. Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[52] Ali Farhadi,et al. Imagine This! Scripts to Compositions to Videos , 2018, ECCV.
[53] Larry S. Davis,et al. Automatic Spatially-Aware Fashion Concept Discovery , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[54] Jonghyun Choi,et al. Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[55] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.
[56] Larry S. Davis,et al. The Amazing Mysteries of the Gutter: Drawing Inferences Between Panels in Comic Book Narratives , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[57] Francis Ferraro,et al. Visual Storytelling , 2016, NAACL.
[58] Edward H. Adelson,et al. Discovering states and transformations in image collections , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[59] H. Turner. Gibson , 1975, Afterimage.
[60] Liqiang Nie,et al. MMCoQA: Conversational Question Answering over Text, Tables, and Images , 2022, ACL.