What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Recent advancements in Large Language Models (LLMs) such as GPT4 have displayed exceptional multi-modal capabilities in following open-ended instructions given images. However, the performance of these models heavily relies on design choices such as network structures, training data, and training strategies, and these choices have not been extensively discussed in the literature, making it difficult to quantify progress in this field. To address this issue, this paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models. We implement over 20 variants with controlled settings. Concretely, for network structures, we compare different LLM backbones and model designs. For training data, we investigate the impact of data and sampling strategies. For instructions, we explore the influence of diversified prompts on the instruction-following ability of the trained models. For benchmarks, we contribute the first, to our best knowledge, comprehensive evaluation set including both image and video tasks through crowd-sourcing. Based on our findings, we present Lynx, which performs the most accurate multi-modal understanding while keeping the best multi-modal generation ability compared to existing open-sourced GPT4-style models.

[1]  Yunhang Shen,et al.  MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , 2023, ArXiv.

[2]  Zhongyu Wei,et al.  Valley: Video Assistant with Large Language model Enhanced abilitY , 2023, ArXiv.

[3]  Salman Khan,et al.  Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , 2023, ArXiv.

[4]  Han Zhang,et al.  Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding , 2023, ArXiv.

[5]  Jianlong Fu,et al.  AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation , 2023, ArXiv.

[6]  Yali Wang,et al.  VideoLLM: Modeling Video Sequence with Large Language Models , 2023, ArXiv.

[7]  Jiannan Wu,et al.  VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks , 2023, NeurIPS.

[8]  Andrew M. Dai,et al.  PaLM 2 Technical Report , 2023, ArXiv.

[9]  Boyang Li,et al.  InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , 2023, NeurIPS.

[10]  Jifeng Dai,et al.  InternGPT: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language , 2023, ArXiv.

[11]  Kai Chen,et al.  MultiModal-GPT: A Vision and Language Model for Dialogue with Humans , 2023, ArXiv.

[12]  Yuanhan Zhang,et al.  Otter: A Multi-Modal Model with In-Context Instruction Tuning , 2023, ArXiv.

[13]  Hongsheng Li,et al.  LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model , 2023, ArXiv.

[14]  Ming Yan,et al.  mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality , 2023, ArXiv.

[15]  Pang Wei Koh,et al.  DataComp: In search of the next generation of multimodal datasets , 2023, ArXiv.

[16]  Jia-Bin Huang,et al.  AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head , 2023, AAAI.

[17]  Mohamed Elhoseiny,et al.  MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , 2023, ArXiv.

[18]  Yong Jae Lee,et al.  Visual Instruction Tuning , 2023, ArXiv.

[19]  Chunyuan Li,et al.  Instruction Tuning with GPT-4 , 2023, ArXiv.

[20]  Julian McAuley,et al.  Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data , 2023, EMNLP.

[21]  Ledell Yu Wu,et al.  EVA-CLIP: Improved Training Techniques for CLIP at Scale , 2023, ArXiv.

[22]  Faisal Ahmed,et al.  MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action , 2023, ArXiv.

[23]  Chenfei Wu,et al.  Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models , 2023, ArXiv.

[24]  Mehdi S. M. Sajjadi,et al.  PaLM-E: An Embodied Multimodal Language Model , 2023, ICML.

[25]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[26]  Li Dong,et al.  Language Is Not All You Need: Aligning Perception with Language Models , 2023, NeurIPS.

[27]  S. Savarese,et al.  BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[28]  Xi Victoria Lin,et al.  OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization , 2022, ArXiv.

[29]  Noah A. Smith,et al.  Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.

[30]  Ishan Misra,et al.  Learning Video Representations from Large Language Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Hang Li,et al.  X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks , 2022, ArXiv.

[32]  Ledell Yu Wu,et al.  EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Alexander M. Rush,et al.  BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.

[34]  Andrew M. Dai,et al.  Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[35]  Ludwig Schmidt,et al.  LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.

[36]  Li Fei-Fei,et al.  VIMA: General Robot Manipulation with Multimodal Prompts , 2022, ArXiv.

[37]  P. Zhang,et al.  GLM-130B: An Open Bilingual Pre-trained Model , 2022, ICLR.

[38]  Ashish V. Thapliyal,et al.  PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, ICLR.

[39]  Peter R. Florence,et al.  Inner Monologue: Embodied Reasoning through Planning with Language Models , 2022, CoRL.

[40]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[41]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[42]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[43]  S. Levine,et al.  Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[44]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[45]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[46]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[47]  Renelito Delos Santos,et al.  LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[48]  Hang Li,et al.  Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts , 2021, ICML.

[49]  Li Dong,et al.  VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts , 2021, NeurIPS.

[50]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[51]  Angela Yao,et al.  NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Zhilin Yang,et al.  GLM: General Language Model Pretraining with Autoregressive Blank Infilling , 2021, ACL.

[53]  Andrew Zisserman,et al.  Perceiver: General Perception with Iterative Attention , 2021, ICML.

[54]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[55]  Radu Soricut,et al.  Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  K. Simonyan,et al.  High-Performance Large-Scale Image Recognition Without Normalization , 2021, ICML.

[57]  Lu Chen,et al.  WebSRC: A Dataset for Web-Based Structural Reading Comprehension , 2021, Conference on Empirical Methods in Natural Language Processing.

[58]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[59]  Olatunji Ruwase,et al.  DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.

[60]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[61]  Douwe Kiela,et al.  The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes , 2020, NeurIPS.

[62]  Marcus Rohrbach,et al.  TextCaps: a Dataset for Image Captioning with Reading Comprehension , 2020, ECCV.

[63]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[64]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[65]  Jordi Pont-Tuset,et al.  Connecting Vision and Language with Localized Narratives , 2019, ECCV.

[66]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[67]  Shashank Shekhar,et al.  OCR-VQA: Visual Question Answering by Reading Text in Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[68]  Ernest Valveny,et al.  Scene Text Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[69]  Ali Farhadi,et al.  OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Xinlei Chen,et al.  Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Asim Kadav,et al.  Visual Entailment: A Novel Task for Fine-Grained Image Understanding , 2019, ArXiv.

[73]  Yoav Artzi,et al.  A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.

[74]  Ibrahim Alper Dogru,et al.  Human Activity Recognition Using Smartphones , 2018, 2018 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT).

[75]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[76]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[77]  Yueting Zhuang,et al.  Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[78]  Kiyoshi Tanaka,et al.  Improved ArtGAN for Conditional Synthesis of Natural Image and Artwork , 2017, IEEE Transactions on Image Processing.

[79]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[80]  Christopher Kanan,et al.  An Analysis of Visual Question Answering Algorithms , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[81]  Xiaogang Wang,et al.  Person Search with Natural Language Description , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Kenji Doya,et al.  Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , 2017, Neural Networks.

[83]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[84]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Yash Goyal,et al.  Yin and Yang: Balancing and Answering Binary Visual Questions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[86]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[88]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[89]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[90]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[91]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[92]  Rob Miller,et al.  VizWiz: nearly real-time answers to visual questions , 2010, UIST.

[93]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[94]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[95]  Xu Tan,et al.  HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face , 2023, NeurIPS.

[96]  Lisa Anne Hendricks,et al.  An empirical analysis of compute-optimal large language model training , 2022, NeurIPS.

[97]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[98]  Paul Clough,et al.  The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .