CUGE: A Chinese Language Understanding and Generation Evaluation Benchmark

Realizing general-purpose language intelligence has been a longstanding goal for natural language processing, where standard evaluation benchmarks play a fundamental and guiding role. We argue that for generalpurpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic. To this end, we propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features: (1) Hierarchical benchmark framework, where datasets are principally selected and organized with a language capability-task-dataset hierarchy. (2) Multilevel scoring strategy, where different levels of model performance are provided based on the hierarchical framework. To facilitate CUGE, we provide a public leaderboard that can be customized to support flexible model judging criteria. Evaluation results on representative pre-trained language models indicate ample room for improvement towards generalpurpose language intelligence. CUGE is publicly available at cuge.baai.ac.cn.

[1]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[2]  Dian Yu,et al.  CLUE: A Chinese Language Understanding Evaluation Benchmark , 2020, COLING.

[3]  Jiancheng Lv,et al.  GLGE: A New General Language Generation Evaluation Benchmark , 2021, FINDINGS.

[4]  Peng Qian,et al.  Overview of the NLPCC-ICCPOL 2016 Shared Task: Chinese Word Segmentation for Micro-Blog Texts , 2016, NLPCC/ICCPOL.

[5]  Claire Cardie,et al.  Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension , 2019, Transactions of the Association for Computational Linguistics.

[6]  Ning Ding,et al.  Chinese Relation Extraction with Multi-Grained Information and External Linguistic Knowledge , 2019, ACL.

[7]  Jiajun Zhang,et al.  NCLS: Neural Cross-Lingual Summarization , 2019, EMNLP.

[8]  Qun Liu,et al.  Multi-channel Reverse Dictionary Model , 2019, AAAI.

[9]  Maosong Sun,et al.  CCPM: A Chinese Classical Poetry Matching Dataset , 2021, ArXiv.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Lee J. Cronbach,et al.  A case study of the splithalf reliability coefficient. , 1946 .

[12]  Thomas Emerson,et al.  The Second International Chinese Word Segmentation Bakeoff , 2005, IJCNLP.

[13]  S. Gathercole,et al.  Evaluating the validity of the Automated Working Memory Assessment , 2008 .

[14]  Minlie Huang,et al.  Long and Diverse Text Generation with Planning-based Hierarchical Variational Model , 2019, EMNLP.

[15]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2021, NAACL.

[16]  Minlie Huang,et al.  KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation , 2020, ACL.

[17]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[18]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[19]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[20]  Zhifang Sui,et al.  Building a Pediatric Medical Corpus: Word Segmentation and Named Entity Annotation , 2020, CLSW.

[21]  Zhiyuan Liu,et al.  CPM-2: Large-scale Cost-effective Pre-trained Language Models , 2021, AI Open.

[22]  Bowen Zhou,et al.  On the Faithfulness for E-commerce Product Summarization , 2020, COLING.

[23]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[24]  Qingcai Chen,et al.  LCSTS: A Large Scale Chinese Short Text Summarization Dataset , 2015, EMNLP.

[25]  Shuming Shi,et al.  Deep Neural Solver for Math Word Problems , 2017, EMNLP.