Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

Multimodal Large Language Models (MLLMs) have recently sparked significant interest, which demonstrates emergent capabilities to serve as a general-purpose model for various vision-language tasks. However, existing methods mainly focus on limited types of instructions with a single image as visual context, which hinders the widespread availability of MLLMs. In this paper, we introduce the I4 benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions, which involve intricate image-text sequential context, covering a diverse range of scenarios (e.g., visually-rich webpages/textbooks, lecture slides, embodied dialogue). Systematic evaluation on our I4 benchmark reveals a common defect of existing methods: the Visual Prompt Generator (VPG) trained on image-captioning alignment objective tends to attend to common foreground information for captioning but struggles to extract specific information required by particular tasks. To address this issue, we propose a generic and lightweight controllable knowledge re-injection module, which utilizes the sophisticated reasoning ability of LLMs to control the VPG to conditionally extract instruction-specific visual information and re-inject it into the LLM. Further, we introduce an annotation-free cross-attention guided counterfactual image training strategy to methodically learn the proposed module by collaborating a cascade of foundation models. Enhanced by the proposed module and training strategy, we present Cheetah, a MLLM that can effectively handle a wide variety of interleaved vision-language instructions and achieves state-of-the-art zero-shot performance across all tasks of I4, without high-quality multimodal instruction tuning data. Moreover, Cheetah also exhibits competitive performance compared with state-of-the-art instruction tuned models on concurrent MME benchmark.

[1]  Eric Michael Smith,et al.  Llama 2: Open Foundation and Fine-Tuned Chat Models , 2023, ArXiv.

[2]  Yunhang Shen,et al.  MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , 2023, ArXiv.

[3]  Tong Xu,et al.  A Survey on Multimodal Large Language Models , 2023, ArXiv.

[4]  Wenqi Shao,et al.  LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models , 2023, ArXiv.

[5]  Oncel Tuzel,et al.  VISION Datasets: A Benchmark for Vision-based InduStrial InspectiON , 2023, ArXiv.

[6]  C. Li,et al.  MIMIC-IT: Multi-Modal In-Context Instruction Tuning , 2023, ArXiv.

[7]  Judy Hoffman,et al.  LANCE: Stress-testing Visual Models by Generating Language-guided Counterfactual Images , 2023, ArXiv.

[8]  Siliang Tang,et al.  Interactive Data Synthesis for Systematic Vision Adaptation via LLMs-AIGCs Collaboration , 2023, ArXiv.

[9]  Boyang Li,et al.  InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , 2023, NeurIPS.

[10]  Hongsheng Li,et al.  LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model , 2023, ArXiv.

[11]  Ming Yan,et al.  mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality , 2023, ArXiv.

[12]  Mohamed Elhoseiny,et al.  MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , 2023, ArXiv.

[13]  Yong Jae Lee,et al.  Visual Instruction Tuning , 2023, ArXiv.

[14]  Ross B. Girshick,et al.  Segment Anything , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Xingyu Zeng,et al.  Explore the Power of Synthetic Data on Few-shot Object Detection , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[16]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[17]  Maneesh Agrawala,et al.  Adding Conditional Control to Text-to-Image Diffusion Models , 2023, ArXiv.

[18]  Aditya Grover,et al.  Leaving Reality to Imagination: Robust Classification via Generated Datasets , 2023, ArXiv.

[19]  S. Savarese,et al.  BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[20]  Taku Hasegawa,et al.  SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images , 2023, AAAI.

[21]  Noah A. Smith,et al.  Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.

[22]  Ledell Yu Wu,et al.  EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Andrew M. Dai,et al.  Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[24]  Philip H. S. Torr,et al.  Is synthetic data from generative models ready for image recognition? , 2022, ICLR.

[25]  Mohit Bansal,et al.  StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation , 2022, ECCV.

[26]  Lingxi Xie,et al.  Fine-Grained Semantically Aligned Vision-Language Pre-Training , 2022, NeurIPS.

[27]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[28]  D. Lischinski,et al.  Blended Diffusion for Text-driven Editing of Natural Images , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jonathan Brandt,et al.  AESOP: Abstract Encoding of Stories, Objects, and Pictures , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[31]  Yonatan Bisk,et al.  WebQA: Multihop and Multimodal QA , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[33]  Yang Wang,et al.  Image Change Captioning by Learning from an Auxiliary Task , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jonathan Berant,et al.  MultiModalQA: Complex Question Answering over Text, Tables and Images , 2021, ICLR.

[35]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[36]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[37]  Rama Chellappa,et al.  Visual Question Answering on Image Sets , 2020, ECCV.

[38]  C. V. Jawahar,et al.  DocVQA: A Dataset for VQA on Document Images , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[39]  Mohit Bansal,et al.  ManyModalQA: Modality Disambiguation and QA over Diverse Inputs , 2020, AAAI.

[40]  Luke Zettlemoyer,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Serge J. Belongie,et al.  Neural Naturalist: Generating Fine-Grained Image Comparisons , 2019, EMNLP.

[42]  Shashank Shekhar,et al.  OCR-VQA: Visual Question Answering by Reading Text in Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[43]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[44]  Qing Li,et al.  Why Does a Visual Question Have Different Answers? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Franck Dernoncourt,et al.  Expressing Visual Relationships via Language , 2019, ACL.

[46]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Yu Cheng,et al.  StoryGAN: A Sequential Conditional GAN for Story Visualization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Yoav Artzi,et al.  A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.

[49]  Nazli Ikizler-Cinbis,et al.  RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes , 2018, EMNLP.

[50]  Harsh Jhamtani,et al.  Learning to Describe Differences Between Pairs of Similar Images , 2018, EMNLP.

[51]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Ali Farhadi,et al.  Imagine This! Scripts to Compositions to Videos , 2018, ECCV.

[53]  Larry S. Davis,et al.  Automatic Spatially-Aware Fashion Concept Discovery , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Jonghyun Choi,et al.  Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[56]  Larry S. Davis,et al.  The Amazing Mysteries of the Gutter: Drawing Inferences Between Panels in Comic Book Narratives , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Francis Ferraro,et al.  Visual Storytelling , 2016, NAACL.

[58]  Edward H. Adelson,et al.  Discovering states and transformations in image collections , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  H. Turner Gibson , 1975, Afterimage.

[60]  Liqiang Nie,et al.  MMCoQA: Conversational Question Answering over Text, Tables, and Images , 2022, ACL.