论文信息 - RoCar: A Relationship Network-based Evaluation Method to Large Language Models

RoCar: A Relationship Network-based Evaluation Method to Large Language Models

Large language models (LLMs) have received increasing attention. However, due to the complexity of its capabilities, how to rationally evaluate the capabilities of LLMs is still a task to be solved. We propose the RoCar method, which utilizes the defined basic schemas to randomly construct a task graph and generates natural language evaluation tasks based on the task graph to evaluate the reasoning and memory abilities of LLMs respectively. Due to the very large randomness of the task construction process, it is possible to ensure that none of the LLMs to be tested has directly learned the evaluation tasks, guaranteeing the fairness of the evaluation method.

[1] Eric Michael Smith,et al. Llama 2: Open Foundation and Fine-Tuned Chat Models , 2023, ArXiv.

[2] Jindong Wang,et al. A Survey on Evaluation of Large Language Models , 2023, ArXiv.

[3] Li Yuan,et al. ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases , 2023, ArXiv.

[4] Liang He,et al. Evaluating the Performance of Large Language Models on GAOKAO Benchmark , 2023, ArXiv.

[5] Maosong Sun,et al. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , 2023, NeurIPS.

[6] Hao Sun,et al. Safety Assessment of Chinese Large Language Models , 2023, ArXiv.

[7] Ting Liu,et al. HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge , 2023, ArXiv.

[8] Weizhu Chen,et al. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models , 2023, ArXiv.

[9] Honglin Xiong,et al. DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task , 2023, ArXiv.

[10] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[11] P. Zhang,et al. GLM-130B: An Open Bilingual Pre-trained Model , 2022, ICLR.

[12] Zhilin Yang,et al. GLM: General Language Model Pretraining with Autoregressive Blank Infilling , 2021, ACL.

[13] A. Hood,et al. Gender , 2019, Textile History.

[14] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.