Estimating POS Annotation Consistency of Different Treebanks in a Language

We introduce a new symmetric measure (called θpos) that utilises the non-symmetric KLcpos3 measure (Rosa and Žabokrtský, 2015) to allow us to compare the annotation consistency between different treebanks of a given language, annotated under the same guidelines. We can set a threshold for this new measure so that a pair of treebanks can be considered harmonious in their annotation if θpos does not surpass the threshold. For the calculation of the threshold, we estimate the effects of (i) the size variation, and (ii) the genre variation in the considered pair of treebanks. The estimations are based on data from treebanks of distinct language families, making the threshold less dependent on the properties of individual languages. We demonstrate the utility of the proposed measure by listing the treebanks in Universal Dependencies version 2.5 (UDv2.5) (Zeman et al., 2019) data that are annotated consistently with other treebanks of the same language. However, the measure could be used to assess inter-treebank annotation consistency under other (non-UD) annotation guidelines as well.

[1]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[2]  Tuomo Kakkonen Dependency treebanks: methods, annotation schemes and tools , 2005, NODALIDA.

[3]  Walt Detmar Meurers,et al.  Detecting Inconsistencies in Treebanks , 2003 .

[4]  Daniel Zeman,et al.  Universal Dependencies for the AnCora treebanks , 2016, Proces. del Leng. Natural.

[5]  Barbara Plank,et al.  Parsing Universal Dependencies without training , 2017, EACL.

[6]  Martin Potthast,et al.  CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2018, CoNLL.

[7]  Ondrej Dusek,et al.  HamleDT: Harmonized multi-language dependency treebank , 2014, Lang. Resour. Evaluation.

[8]  Walt Detmar Meurers,et al.  Detecting Errors in Discontinuous Structural Annotation , 2005, ACL.

[9]  Rudolf Rosa,et al.  KLcpos3 - a Language Similarity Measure for Delexicalized Parser Transfer , 2015, ACL.

[10]  Daniel Zeman,et al.  Data Conversion and Consistency of Monolingual Corpora: Russian UD Treebanks , 2018 .

[11]  Na-Rae Han,et al.  Building Universal Dependency Treebanks in Korean , 2018, LREC.

[12]  Chiara Alzetta,et al.  Universal Dependencies and Quantitative Typological Trends. A Case Study on Word Order , 2018, LREC.

[13]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[14]  Yijia Liu,et al.  Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation , 2018, CoNLL.

[15]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[16]  Sampo Pyysalo,et al.  Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection , 2020, LREC.

[17]  Walt Detmar Meurers,et al.  Detecting Errors in Part-of-Speech Annotation , 2003, EACL.

[18]  Zdenek Zabokrtský,et al.  Udapi: Universal API for Universal Dependencies , 2017, UDW@NoDaLiDa.