KLUE: Korean Language Understanding Evaluation

We introduce Korean Language Understanding Evaluation (KLUE) benchmark. KLUE is a collection of 8 Korean natural language understanding (NLU) tasks, including Topic Classification, Semantic Textual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking. We build all of the tasks from scratch from diverse source corpora while respecting copyrights, to ensure accessibility for anyone without any restrictions. With ethical considerations in mind, we carefully design annotation protocols. Along with the benchmark tasks and data, we provide suitable evaluation metrics and fine-tuning recipes for pretrained language models for each task. We furthermore release the pretrained language models (PLM), KLUE-BERT and KLUE-RoBERTa, to help reproducing baseline models on KLUE and thereby facilitate future research. We make a few interesting observations from the preliminary experiments using the proposed KLUE benchmark suite, already demonstrating the usefulness of this new benchmark suite. First, we find KLUE-RoBERTaLARGE outperforms other baselines, including multilingual PLMs and existing open-source Korean PLMs. Second, we see minimal degradation in performance even when we replace personally identifiable information from the pretraining corpus, suggesting that privacy and NLU capability are not at odds with each other. Lastly, we find that using BPE tokenization in combination with morpheme-level pre-tokenization is effective in tasks involving morpheme-level tagging, detection and generation. In addition to accelerating Korean NLP research, our comprehensive documentation on creating KLUE will facilitate creating similar resources for other languages in the future. KLUE is available at https://klue-benchmark.com/. Equal Contribution. A description of each author’s contribution is available at the end of paper. Corresponding Authors. †Work done at Upstage. ar X iv :2 10 5. 09 68 0v 4 [ cs .C L ] 2 N ov 2 02 1

[1]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[2]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[3]  J. F. Kelley,et al.  An iterative design methodology for user-friendly natural language office information applications , 1984, TOIS.

[4]  Key-Sun Choi,et al.  KAIST Tree Bank Project for Korean: Present and Future Development , 1994 .

[5]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[6]  Philip Bachman,et al.  NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.

[7]  Maosong Sun,et al.  DocRED: A Large-Scale Document-Level Relation Extraction Dataset , 2019, ACL.

[8]  Taku Kudo,et al.  MeCab : Yet Another Part-of-Speech and Morphological Analyzer , 2005 .

[9]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[10]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[11]  Stefan Ultes,et al.  MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling , 2018, EMNLP.

[12]  Benjamin Lecouteux,et al.  FlauBERT: Unsupervised Language Model Pre-training for French , 2020, LREC.

[13]  Ayu Purwarianti,et al.  IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding , 2020, AACL.

[14]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[15]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[16]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[17]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[18]  Mitesh M. Khapra,et al.  iNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages , 2020, FINDINGS.

[19]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[20]  Matthew Henderson,et al.  The third Dialog State Tracking Challenge , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[21]  Andrew McCallum,et al.  Modeling Relations and Their Mentions without Labeled Text , 2010, ECML/PKDD.

[22]  Ahmadreza Mosallanezhad,et al.  ParsiNLU: A Suite of Language Understanding Challenges for Persian , 2020, Transactions of the Association for Computational Linguistics.

[23]  Na-Rae Han,et al.  Building Universal Dependency Treebanks in Korean , 2018, LREC.

[24]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[25]  Emily M. Bender,et al.  Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.

[26]  Claire Cardie,et al.  Dialogue-Based Relation Extraction , 2020, ACL.

[27]  Seung-won Hwang,et al.  Semi-supervised Training Data Generation for Multilingual Question Answering , 2018, LREC.

[28]  Gökhan Tür,et al.  Building a Conversational Agent Overnight with Dialogue Self-Play , 2018, ArXiv.

[29]  Mitesh M. Khapra,et al.  DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension , 2018, ACL.

[30]  Richard Socher,et al.  Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems , 2019, ACL.

[31]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[32]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[33]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[34]  Dan Roth,et al.  Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences , 2018, NAACL.

[35]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[36]  Eliyahu Kiperwasser,et al.  Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations , 2016, TACL.

[37]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[38]  Xu Sun,et al.  A Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text , 2017, ArXiv.

[39]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[40]  Seongbo Jang,et al.  An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks , 2020, AACL.

[41]  Hong Yu,et al.  Dynamic Data Selection for Curriculum Learning via Ability Estimation , 2020, FINDINGS.

[42]  Mihail Eric,et al.  MultiWOZ 2. , 2019 .

[43]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[44]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[45]  Jason Baldridge,et al.  PAWS: Paraphrase Adversaries from Word Scrambling , 2019, NAACL.

[46]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[47]  Matthew Henderson,et al.  The Second Dialog State Tracking Challenge , 2014, SIGDIAL Conference.

[48]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[49]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[50]  Christopher D. Manning,et al.  Key-Value Retrieval Networks for Task-Oriented Dialogue , 2017, SIGDIAL Conference.

[51]  Samuel R. Bowman,et al.  Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , 2018, ArXiv.

[52]  Raghav Gupta,et al.  Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset , 2020, AAAI.

[53]  Nicola De Cao,et al.  KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, NAACL.

[54]  Yi Yang,et al.  WikiQA: A Challenge Dataset for Open-Domain Question Answering , 2015, EMNLP.

[55]  Hannes Schulz,et al.  Frames: a corpus for adding memory to goal-oriented dialogue systems , 2017, SIGDIAL Conference.

[56]  Won Ik Cho,et al.  Open Korean Corpora: A Practical Report , 2020, NLPOSS.

[57]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[58]  Dilek Z. Hakkani-Tür,et al.  DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue , 2020, ArXiv.

[59]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[60]  Jaewoo Kang,et al.  Look at the First Sentence: Position Bias in Question Answering , 2020, EMNLP.

[61]  Samuel R. Bowman,et al.  What Will it Take to Fix Benchmarking in Natural Language Understanding? , 2021, NAACL.

[62]  Zhiyuan Liu,et al.  FewRel: A Large-Scale Supervised Few-Shot Relation Classification Dataset with State-of-the-Art Evaluation , 2018, EMNLP.

[63]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[64]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[65]  Preslav Nakov,et al.  SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals , 2009, SEW@NAACL-HLT.

[66]  Xiaodong Liu,et al.  ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension , 2018, ArXiv.

[67]  Samuel R. Bowman,et al.  CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models , 2020, EMNLP.

[68]  Heng Ji,et al.  Overview of the TAC 2010 Knowledge Base Population Track , 2010 .

[69]  Douwe Kiela,et al.  The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes , 2020, NeurIPS.

[70]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[71]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[72]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[73]  Sangha Nam,et al.  Effective Crowdsourcing of Multiple Tasks for Comprehensive Knowledge Extraction , 2020, LREC.

[74]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[75]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[76]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[77]  Claire Cardie,et al.  SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability , 2015, *SEMEVAL.

[78]  Yo Joong Choe,et al.  KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding , 2020, FINDINGS.

[79]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[80]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[81]  R. Thomas McCoy,et al.  Syntactic Data Augmentation Increases Robustness to Inference Heuristics , 2020, ACL.

[82]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[83]  Dian Yu,et al.  CLUE: A Chinese Language Understanding Evaluation Benchmark , 2020, COLING.

[84]  Lawrence S. Moss,et al.  OCNLI: Original Chinese Natural Language Inference , 2020, FINDINGS.

[85]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[86]  Carlos Gómez-Rodríguez,et al.  Left-to-Right Dependency Parsing with Pointer Networks , 2019, NAACL.

[87]  Iryna Gurevych,et al.  How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models , 2021, ACL/IJCNLP.

[88]  Eunjeong Lucy Park,et al.  KoNLPy: Korean natural language processing in Python , 2014 .

[89]  Jens Allwood,et al.  An activity-based approach to pragmatics , 2000, Abduction, Belief and Context in Dialogue.

[90]  Alena Fenogenova,et al.  RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark , 2020, EMNLP.

[91]  Zheng Zhang,et al.  CrossWOZ: A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset , 2020, Transactions of the Association for Computational Linguistics.

[92]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[93]  Jason Weston,et al.  ELI5: Long Form Question Answering , 2019, ACL.

[94]  Won-Ik Cho,et al.  Discourse Component to Sentence (DC2S): An Efficient Human-Aided Construction of Paraphrase and Sentence Similarity Dataset , 2020, LREC.

[95]  Jason Baldridge,et al.  PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification , 2019, EMNLP.

[96]  Seungyoung Lim,et al.  KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension , 2019, ArXiv.

[97]  Christopher D. Manning,et al.  The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[98]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[99]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[100]  Tae Hwan Oh,et al.  Annotation Issues in Universal Dependencies for Korean and Japanese , 2020, UDW.

[101]  David Vandyke,et al.  A Network-based End-to-End Trainable Task-oriented Dialogue System , 2016, EACL.

[102]  Bill Byrne,et al.  Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset , 2019, EMNLP.

[103]  Nancy A. Chinchor,et al.  Overview of MUC-7 , 1998, MUC.

[104]  Ali Jabbari,et al.  A French Corpus and Annotation Schema for Named Entity Recognition and Relation Extraction of Financial News , 2020, LREC.

[105]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[106]  Xifeng Yan,et al.  CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers , 2020, ArXiv.

[107]  Ming-Wei Chang,et al.  BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , 2019, NAACL.

[108]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[109]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[110]  Samuel R. Bowman,et al.  Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark , 2019, ACL.

[111]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[112]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[113]  Yejin Choi,et al.  On-the-Fly Controlled Text Generation with Experts and Anti-Experts , 2021, ArXiv.

[114]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[115]  Graham Neubig,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[116]  Samuel R. Bowman,et al.  Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options , 2020, AACL.

[117]  Timothy Dozat,et al.  Deep Biaffine Attention for Neural Dependency Parsing , 2016, ICLR.

[118]  Tae Hwan Oh,et al.  Analysis of the Penn Korean Universal Dependency Treebank (PKT-UD): Manual Revision to Build Robust Parsing Model in Korean , 2020, IWPT 2020.

[119]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[120]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[121]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[122]  Danqi Chen,et al.  Position-aware Attention and Supervised Data Improve Slot Filling , 2017, EMNLP.

[123]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[124]  Jung-Woo Ha,et al.  NSML: Meet the MLaaS platform with a real-world case study , 2018, ArXiv.

[125]  Hyunjeong Lee,et al.  KorQuAD 2.0: Korean QA Dataset for Web Document Machine Comprehension , 2020 .

[126]  Philippe Thomas,et al.  A German Corpus for Fine-Grained Named Entity Recognition and Relation Extraction of Traffic and Industry Events , 2018, LREC.

[127]  Won Ik Cho,et al.  BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection , 2020, SOCIALNLP.

[128]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[129]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[130]  Fan Yang,et al.  XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation , 2020, EMNLP.

[131]  Lawrence Carin,et al.  FairFil: Contrastive Neural Debiasing Method for Pretrained Text Encoders , 2021, ICLR.

[132]  Sameer Singh,et al.  Compositional Questions Do Not Necessitate Multi-hop Reasoning , 2019, ACL.

[133]  Yoav Goldberg,et al.  Breaking NLI Systems with Sentences that Require Simple Lexical Inferences , 2018, ACL.

[134]  Heng Ji,et al.  Cross-lingual Name Tagging and Linking for 282 Languages , 2017, ACL.

[135]  Qian Cao,et al.  RiSAWOZ: A Large-Scale Multi-Domain Wizard-of-Oz Dataset with Rich Semantic Annotations for Task-Oriented Dialogue Modeling , 2020, EMNLP.

[136]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[137]  Alan Ritter,et al.  Results of the WNUT16 Named Entity Recognition Shared Task , 2016, NUT@COLING.

[138]  Sangah Lee,et al.  KR-BERT: A Small-Scale Korean-Specific Language Model , 2020, 2008.03979.

[139]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[140]  Jiliang Tang,et al.  A Survey on Dialogue Systems: Recent Advances and New Frontiers , 2017, SKDD.

[141]  Klaus Krippendorff,et al.  Computing Krippendorff's Alpha-Reliability , 2011 .

[142]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[143]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[144]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[145]  Joakim Nivre,et al.  Universal Dependency Annotation for Multilingual Parsing , 2013, ACL.

[146]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[147]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[148]  Amir Saffari,et al.  What Do Models Learn from Question Answering Datasets? , 2020, EMNLP.

[149]  Jeffrey Ling,et al.  Matching the Blanks: Distributional Similarity for Relation Learning , 2019, ACL.