What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
暂无分享,去创建一个
Yan Zeng | Tao Kong | Jiangnan Xia | Guoqiang Wei | Yuchen Zhang | Hanbo Zhang | Jiani Zheng | Yang Wei
[1] Yunhang Shen,et al. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , 2023, ArXiv.
[2] Zhongyu Wei,et al. Valley: Video Assistant with Large Language model Enhanced abilitY , 2023, ArXiv.
[3] Salman Khan,et al. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , 2023, ArXiv.
[4] Han Zhang,et al. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding , 2023, ArXiv.
[5] Jianlong Fu,et al. AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation , 2023, ArXiv.
[6] Yali Wang,et al. VideoLLM: Modeling Video Sequence with Large Language Models , 2023, ArXiv.
[7] Jiannan Wu,et al. VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks , 2023, NeurIPS.
[8] Andrew M. Dai,et al. PaLM 2 Technical Report , 2023, ArXiv.
[9] Boyang Li,et al. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , 2023, NeurIPS.
[10] Jifeng Dai,et al. InternGPT: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language , 2023, ArXiv.
[11] Kai Chen,et al. MultiModal-GPT: A Vision and Language Model for Dialogue with Humans , 2023, ArXiv.
[12] Yuanhan Zhang,et al. Otter: A Multi-Modal Model with In-Context Instruction Tuning , 2023, ArXiv.
[13] Hongsheng Li,et al. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model , 2023, ArXiv.
[14] Ming Yan,et al. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality , 2023, ArXiv.
[15] Pang Wei Koh,et al. DataComp: In search of the next generation of multimodal datasets , 2023, ArXiv.
[16] Jia-Bin Huang,et al. AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head , 2023, AAAI.
[17] Mohamed Elhoseiny,et al. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , 2023, ArXiv.
[18] Yong Jae Lee,et al. Visual Instruction Tuning , 2023, ArXiv.
[19] Chunyuan Li,et al. Instruction Tuning with GPT-4 , 2023, ArXiv.
[20] Julian McAuley,et al. Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data , 2023, EMNLP.
[21] Ledell Yu Wu,et al. EVA-CLIP: Improved Training Techniques for CLIP at Scale , 2023, ArXiv.
[22] Faisal Ahmed,et al. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action , 2023, ArXiv.
[23] Chenfei Wu,et al. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models , 2023, ArXiv.
[24] Mehdi S. M. Sajjadi,et al. PaLM-E: An Embodied Multimodal Language Model , 2023, ICML.
[25] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.
[26] Li Dong,et al. Language Is Not All You Need: Aligning Perception with Language Models , 2023, NeurIPS.
[27] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.
[28] Xi Victoria Lin,et al. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization , 2022, ArXiv.
[29] Noah A. Smith,et al. Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.
[30] Ishan Misra,et al. Learning Video Representations from Large Language Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[31] Hang Li,et al. X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks , 2022, ArXiv.
[32] Ledell Yu Wu,et al. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Alexander M. Rush,et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.
[34] Andrew M. Dai,et al. Scaling Instruction-Finetuned Language Models , 2022, ArXiv.
[35] Ludwig Schmidt,et al. LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.
[36] Li Fei-Fei,et al. VIMA: General Robot Manipulation with Multimodal Prompts , 2022, ArXiv.
[37] P. Zhang,et al. GLM-130B: An Open Bilingual Pre-trained Model , 2022, ICLR.
[38] Ashish V. Thapliyal,et al. PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, ICLR.
[39] Peter R. Florence,et al. Inner Monologue: Embodied Reasoning through Planning with Language Models , 2022, CoRL.
[40] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.
[41] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.
[42] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..
[43] S. Levine,et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.
[44] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.
[45] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.
[46] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.
[47] Renelito Delos Santos,et al. LaMDA: Language Models for Dialog Applications , 2022, ArXiv.
[48] Hang Li,et al. Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts , 2021, ICML.
[49] Li Dong,et al. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts , 2021, NeurIPS.
[50] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.
[51] Angela Yao,et al. NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[52] Zhilin Yang,et al. GLM: General Language Model Pretraining with Autoregressive Blank Infilling , 2021, ACL.
[53] Andrew Zisserman,et al. Perceiver: General Perception with Iterative Attention , 2021, ICML.
[54] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[55] Radu Soricut,et al. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[56] K. Simonyan,et al. High-Performance Large-Scale Image Recognition Without Normalization , 2021, ICML.
[57] Lu Chen,et al. WebSRC: A Dataset for Web-Based Structural Reading Comprehension , 2021, Conference on Empirical Methods in Natural Language Processing.
[58] Charles Foster,et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.
[59] Olatunji Ruwase,et al. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.
[60] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[61] Douwe Kiela,et al. The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes , 2020, NeurIPS.
[62] Marcus Rohrbach,et al. TextCaps: a Dataset for Image Captioning with Reading Comprehension , 2020, ECCV.
[63] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.
[64] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.
[65] Jordi Pont-Tuset,et al. Connecting Vision and Language with Localized Narratives , 2019, ECCV.
[66] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[67] Shashank Shekhar,et al. OCR-VQA: Visual Question Answering by Reading Text in Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).
[68] Ernest Valveny,et al. Scene Text Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[69] Ali Farhadi,et al. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[70] Xinlei Chen,et al. Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[71] Christopher D. Manning,et al. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[72] Asim Kadav,et al. Visual Entailment: A Novel Task for Fine-Grained Image Understanding , 2019, ArXiv.
[73] Yoav Artzi,et al. A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.
[74] Ibrahim Alper Dogru,et al. Human Activity Recognition Using Smartphones , 2018, 2018 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT).
[75] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[76] Bolei Zhou,et al. Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[77] Yueting Zhuang,et al. Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.
[78] Kiyoshi Tanaka,et al. Improved ArtGAN for Conditional Synthesis of Natural Image and Artwork , 2017, IEEE Transactions on Image Processing.
[79] Susanne Westphal,et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[80] Christopher Kanan,et al. An Analysis of Visual Question Answering Algorithms , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[81] Xiaogang Wang,et al. Person Search with Natural Language Description , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[82] Kenji Doya,et al. Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , 2017, Neural Networks.
[83] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[84] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[85] Yash Goyal,et al. Yin and Yang: Balancing and Answering Binary Visual Questions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[86] Michael S. Bernstein,et al. Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[87] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[88] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[89] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.
[90] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.
[91] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.
[92] Rob Miller,et al. VizWiz: nearly real-time answers to visual questions , 2010, UIST.
[93] Cyrus Rashtchian,et al. Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.
[94] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[95] Xu Tan,et al. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face , 2023, NeurIPS.
[96] Lisa Anne Hendricks,et al. An empirical analysis of compute-optimal large language model training , 2022, NeurIPS.
[97] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[98] Paul Clough,et al. The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .