Analysis of the Penn Korean Universal Dependency Treebank (PKT-UD): Manual Revision to Build Robust Parsing Model in Korean

In this paper, we first open on important issues regarding the Penn Korean Universal Treebank (PKT-UD) and address these issues by revising the entire corpus manually with the aim of producing cleaner UD annotations that are more faithful to Korean grammar. For compatibility to the rest of UD corpora, we follow the UDv2 guidelines, and extensively revise the part-of-speech tags and the dependency relations to reflect morphological features and flexible word-order aspects in Korean. The original and the revised versions of PKT-UD are experimented with transformer-based parsing models using biaffine attention. The parsing model trained on the revised corpus shows a significant improvement of 3.0% in labeled attachment score over the model trained on the previous corpus. Our error analysis demonstrates that this revision allows the parsing model to learn relations more robustly, reducing several critical errors that used to be made by the previous model.

[1]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[2]  Martha Palmer,et al.  Penn Korean Treebank : Development and Evaluation , 2002, PACLIC.

[3]  Joakim Nivre,et al.  Universal Dependency Annotation for Multilingual Parsing , 2013, ACL.

[4]  Martha Palmer,et al.  Statistical Dependency Parsing in Korean: From Corpus Generation To Automatic Parsing , 2011, SPMRL@IWPT.

[5]  Han He,et al.  Establishing Strong Baselines for the New Decade: Sequence Tagging, Syntactic and Semantic Parsing with BERT , 2019, FLAIRS.

[6]  Timothy Dozat,et al.  Deep Biaffine Attention for Neural Dependency Parsing , 2016, ICLR.

[7]  Yuji Matsumoto,et al.  Coordinate Structures in Universal Dependencies for Head-final Languages , 2018, UDW@EMNLP.

[8]  Hyejin Park,et al.  Universal POS Tagset for Korean , 2018, Language and Information.

[9]  Na-Rae Han,et al.  Building Universal Dependency Treebanks in Korean , 2018, LREC.

[10]  Hansaem Kim,et al.  A Study on Universal Dependency Annotation for Korean , 2019 .

[11]  Yuji Matsumoto,et al.  Universal Dependencies Version 2 for Japanese , 2018, LREC.

[12]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[13]  Nizar Habash,et al.  CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2017, CoNLL.

[14]  Joon-Ho Lim,et al.  Korean Dependency Guidelines for Dependency Parsing and Exo-Brain Language Analysis Corpus , 2015 .

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.