RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

In this paper, we introduce an advanced Russian general language understanding evaluation benchmark – Russian SuperGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logical operations regardless of text subject or lexicon. For the first time, a benchmark of nine tasks, collected and organized analogically to the SuperGLUE methodology, was developed from scratch for the Russian language. We also provide baselines, human level evaluation, open-source framework for evaluating models, and an overall leaderboard of transformer models for the Russian language. Besides, we present the first results of comparing multilingual models in the translated diagnostic test set and offer the first steps to further expanding or assessing State-of-the-art models independently of language.

[1]  Xiaodong Liu,et al.  ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension , 2018, ArXiv.

[2]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[3]  Mikhail Arkhipov,et al.  Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language , 2019, ArXiv.

[4]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[5]  Douwe Kiela,et al.  SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.

[6]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[7]  Dan Roth,et al.  Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences , 2018, NAACL.

[8]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[9]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[10]  Graham Neubig,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[11]  Dmitry Ustalov,et al.  RUSSE: The First Workshop on Russian Semantic Similarity , 2015, ArXiv.

[12]  Benjamin Lecouteux,et al.  FlauBERT: Unsupervised Language Model Pre-training for French , 2020, LREC.

[13]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[14]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[15]  Shikha Bordia,et al.  Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs , 2019, EMNLP.

[16]  Ming-Wei Chang,et al.  BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , 2019, NAACL.

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  Iryna Gurevych,et al.  LINSPECTOR WEB: A Multilingual Probing Suite for Word Representations , 2019, EMNLP.

[19]  Fan Yang,et al.  XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation , 2020, EMNLP.

[20]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.