M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models

Despite the existence of various benchmarks for evaluating natural language processing models, we argue that human exams are a more suitable means of evaluating general intelligence for large language models (LLMs), as they inherently demand a much wider range of abilities such as language understanding, domain knowledge, and problem-solving skills. To this end, we introduce M3Exam, a novel benchmark sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. M3Exam exhibits three unique characteristics: (1) multilingualism, encompassing questions from multiple countries that require strong multilingual proficiency and cultural knowledge; (2) multimodality, accounting for the multimodal nature of many exam questions to test the model's multimodal understanding capability; and (3) multilevel structure, featuring exams from three critical educational periods to comprehensively assess a model's proficiency at different levels. In total, M3Exam contains 12,317 questions in 9 diverse languages with three educational levels, where about 23\% of the questions require processing images for successful solving. We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text, particularly in low-resource and non-Latin script languages. Multimodal LLMs also perform poorly with complex multimodal questions. We believe that M3Exam can be a valuable resource for comprehensively evaluating LLMs by examining their multilingual and multimodal abilities and tracking their development. Data and evaluation code is available at \url{https://github.com/DAMO-NLP-SG/M3Exam}.

[1]  Liang He,et al.  Evaluating the Performance of Large Language Models on GAOKAO Benchmark , 2023, ArXiv.

[2]  Andrew M. Dai,et al.  PaLM 2 Technical Report , 2023, ArXiv.

[3]  Maosong Sun,et al.  C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , 2023, NeurIPS.

[4]  Boyang Li,et al.  InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , 2023, NeurIPS.

[5]  Weizhu Chen,et al.  AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models , 2023, ArXiv.

[6]  Amir Pouran Ben Veyseh,et al.  ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning , 2023, EMNLP.

[7]  Dragomir R. Radev,et al.  Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations , 2023, ArXiv.

[8]  P. Kambadur,et al.  BloombergGPT: A Large Language Model for Finance , 2023, ArXiv.

[9]  You Zhang,et al.  ChatDoctor: A Medical Chat Model Fine-tuned on LLaMA Model using Medical Domain Knowledge , 2023, ArXiv.

[10]  Marco Tulio Ribeiro,et al.  Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[11]  Sunayana Sitaram,et al.  MEGA: Multilingual Evaluation of Generative AI , 2023, ArXiv.

[12]  Henrique Pondé de Oliveira Pinto,et al.  GPT-4 Technical Report , 2023, 2303.08774.

[13]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[14]  Dan Su,et al.  A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity , 2023, IJCNLP.

[15]  Jing Yu Koh,et al.  Grounding Language Models to Images for Multimodal Inputs and Outputs , 2023, ICML.

[16]  S. Savarese,et al.  BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[17]  Alexander M. Rush,et al.  BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.

[18]  Yuhuai Wu,et al.  Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs , 2022, ICLR.

[19]  Andrew M. Dai,et al.  Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[20]  Yuhuai Wu,et al.  Solving Quantitative Reasoning Problems with Language Models , 2022, NeurIPS.

[21]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[22]  Nigel Collier,et al.  Visually Grounded Reasoning across Languages and Cultures , 2021, EMNLP.

[23]  Jinlan Fu,et al.  XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation , 2021, EMNLP.

[24]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[25]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[26]  A. Korhonen,et al.  XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning , 2020, EMNLP.

[27]  Orhan Firat,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[28]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[29]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Xinlei Chen,et al.  nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[32]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[33]  Yueting Zhuang,et al.  Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[34]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[35]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[36]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[37]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[38]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.