M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
暂无分享,去创建一个
[1] Liang He,et al. Evaluating the Performance of Large Language Models on GAOKAO Benchmark , 2023, ArXiv.
[2] Andrew M. Dai,et al. PaLM 2 Technical Report , 2023, ArXiv.
[3] Maosong Sun,et al. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , 2023, NeurIPS.
[4] Boyang Li,et al. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , 2023, NeurIPS.
[5] Weizhu Chen,et al. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models , 2023, ArXiv.
[6] Amir Pouran Ben Veyseh,et al. ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning , 2023, EMNLP.
[7] Dragomir R. Radev,et al. Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations , 2023, ArXiv.
[8] P. Kambadur,et al. BloombergGPT: A Large Language Model for Finance , 2023, ArXiv.
[9] You Zhang,et al. ChatDoctor: A Medical Chat Model Fine-tuned on LLaMA Model using Medical Domain Knowledge , 2023, ArXiv.
[10] Marco Tulio Ribeiro,et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.
[11] Sunayana Sitaram,et al. MEGA: Multilingual Evaluation of Generative AI , 2023, ArXiv.
[12] Henrique Pondé de Oliveira Pinto,et al. GPT-4 Technical Report , 2023, 2303.08774.
[13] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.
[14] Dan Su,et al. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity , 2023, IJCNLP.
[15] Jing Yu Koh,et al. Grounding Language Models to Images for Multimodal Inputs and Outputs , 2023, ICML.
[16] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.
[17] Alexander M. Rush,et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.
[18] Yuhuai Wu,et al. Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs , 2022, ICLR.
[19] Andrew M. Dai,et al. Scaling Instruction-Finetuned Language Models , 2022, ArXiv.
[20] Yuhuai Wu,et al. Solving Quantitative Reasoning Problems with Language Models , 2022, NeurIPS.
[21] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.
[22] Nigel Collier,et al. Visually Grounded Reasoning across Languages and Cultures , 2021, EMNLP.
[23] Jinlan Fu,et al. XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation , 2021, EMNLP.
[24] Dawn Song,et al. Measuring Massive Multitask Language Understanding , 2020, ICLR.
[25] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[26] A. Korhonen,et al. XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning , 2020, EMNLP.
[27] Orhan Firat,et al. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.
[28] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.
[29] Christopher D. Manning,et al. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[30] Xinlei Chen,et al. nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[31] Guillaume Lample,et al. XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.
[32] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[33] Yueting Zhuang,et al. Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.
[34] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.
[35] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.
[36] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[37] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[38] Philipp Koehn,et al. Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.