MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
暂无分享,去创建一个
Kevin Lin | Lijuan Wang | Zicheng Liu | Linjie Li | Jianfeng Wang | Weihao Yu | Zhengyuan Yang | Xinchao Wang
[1] Yuying Ge,et al. SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , 2023, ArXiv.
[2] Eric Michael Smith,et al. Llama 2: Open Foundation and Fine-Tuned Chat Models , 2023, ArXiv.
[3] Dahua Lin,et al. MMBench: Is Your Multi-modal Model an All-around Player? , 2023, ArXiv.
[4] Yan Zeng,et al. What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? , 2023, ArXiv.
[5] Yunhang Shen,et al. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , 2023, ArXiv.
[6] Wenqi Shao,et al. LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models , 2023, ArXiv.
[7] Mike Zheng Shou,et al. AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn , 2023, ArXiv.
[8] E. Xing,et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , 2023, ArXiv.
[9] C. Li,et al. MIMIC-IT: Multi-Modal In-Context Instruction Tuning , 2023, ArXiv.
[10] Jiannan Wu,et al. VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks , 2023, NeurIPS.
[11] Andrew M. Dai,et al. PaLM 2 Technical Report , 2023, ArXiv.
[12] Chunyuan Li,et al. On the Hidden Mystery of OCR in Large Multimodal Models , 2023, ArXiv.
[13] Boyang Li,et al. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , 2023, NeurIPS.
[14] Kai Chen,et al. MultiModal-GPT: A Vision and Language Model for Dialogue with Humans , 2023, ArXiv.
[15] Yuanhan Zhang,et al. Otter: A Multi-Modal Model with In-Context Instruction Tuning , 2023, ArXiv.
[16] Hung-yi Lee,et al. Can Large Language Models Be an Alternative to Human Evaluations? , 2023, ACL.
[17] Hongsheng Li,et al. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model , 2023, ArXiv.
[18] Ming Yan,et al. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality , 2023, ArXiv.
[19] Mohamed Elhoseiny,et al. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , 2023, ArXiv.
[20] Yong Jae Lee,et al. Visual Instruction Tuning , 2023, ArXiv.
[21] William Yang Wang,et al. Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text , 2023, NeurIPS.
[22] Chunyuan Li,et al. Instruction Tuning with GPT-4 , 2023, ArXiv.
[23] Dan Iter,et al. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment , 2023, EMNLP.
[24] Marco Tulio Ribeiro,et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.
[25] Faisal Ahmed,et al. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action , 2023, ArXiv.
[26] Mehdi S. M. Sajjadi,et al. PaLM-E: An Embodied Multimodal Language Model , 2023, ICML.
[27] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.
[28] Pengfei Liu,et al. GPTScore: Evaluate as You Desire , 2023, NAACL.
[29] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.
[30] Jong Wook Kim,et al. Robust Speech Recognition via Large-Scale Weak Supervision , 2022, ICML.
[31] Jamie Callan,et al. PAL: Program-aided Language Models , 2022, ICML.
[32] Ledell Yu Wu,et al. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Andrew M. Dai,et al. Scaling Instruction-Finetuned Language Models , 2022, ArXiv.
[34] Jianfeng Gao,et al. Vision-Language Pre-training: Basics, Recent Advances, and Future Trends , 2022, Found. Trends Comput. Graph. Vis..
[35] Shannon L. Spruit,et al. No Language Left Behind: Scaling Human-Centered Machine Translation , 2022, ArXiv.
[36] Zhe Gan,et al. GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..
[37] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.
[38] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.
[39] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..
[40] Adrian S. Wong,et al. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language , 2022, ICLR.
[41] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.
[42] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.
[43] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.
[44] Alexander S. Ecker,et al. Image Segmentation Using Text and Image Prompts , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Dongyoon Han,et al. OCR-Free Document Understanding Transformer , 2021, ECCV.
[46] Faisal Ahmed,et al. UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling , 2021, ECCV.
[47] Jenia Jitsev,et al. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.
[48] Rui Wang,et al. SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing , 2021, ACL.
[49] David J. Fleet,et al. Pix2seq: A Language Modeling Framework for Object Detection , 2021, ICLR.
[50] Zhe Gan,et al. An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA , 2021, AAAI.
[51] Adams Wei Yu,et al. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.
[52] Oriol Vinyals,et al. Multimodal Few-Shot Learning with Frozen Language Models , 2021, NeurIPS.
[53] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[54] Radu Soricut,et al. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[55] Wonjae Kim,et al. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.
[56] Jiebo Luo,et al. TAP: Text-Aware Pre-training for Text-VQA and Text-Caption , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[57] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[58] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[59] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[60] Marcus Rohrbach,et al. TextCaps: a Dataset for Image Captioning with Reading Comprehension , 2020, ECCV.
[61] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.
[62] Ahmed El Kholy,et al. UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.
[63] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[64] Ali Farhadi,et al. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[65] Xinlei Chen,et al. Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[66] Christopher D. Manning,et al. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[67] Ali Farhadi,et al. From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[68] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[69] Ronald M. Summers,et al. ChestX-ray: Hospital-Scale Chest X-ray Database and Benchmarks on Weakly Supervised Classification and Localization of Common Thorax Diseases , 2019, Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics.
[70] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.
[71] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[72] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[73] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[74] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[75] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.
[76] Xu Tan,et al. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face , 2023, NeurIPS.
[77] Xinlei Chen,et al. nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[78] Marshall Copeland,et al. Microsoft Azure , 2015, Apress.
[79] 方华. google,我,萨娜 , 2006 .