Planting Trees in the Desert: Delexicalized Tagging and Parsing Combined

Various unsupervised and semi-supervised methods have been proposed to tag and parse an unseen language. We explore delexicalized parsing, proposed by (Zeman and Resnik, 2008), and delexicalized tagging, proposed by (Yu et al., 2016). For both approaches we provide a detailed evaluation on Universal Dependencies data (Nivre et al., 2016), a de-facto standard for multi-lingual morphosyntactic processing (while the previous work used other datasets). Our results confirm that in separation, each of the two delexicalized techniques has some limited potential when no annotation of the target language is available. However, if used in combination, their errors multiply beyond acceptable limits. We demonstrate that even the tiniest bit of expert annotation in the target language may contain significant potential and should be used if available.

[1]  Yuji Matsumoto MaltParser: A language-independent system for data-driven dependency parsing , 2005 .

[2]  Christo Kirov,et al.  Remote Elicitation of Inflectional Paradigms to Seed Morphological Analysis in Low-Resource Languages , 2016, LREC.

[3]  Rudolf Rosa,et al.  KLcpos3 - a Language Similarity Measure for Delexicalized Parser Transfer , 2015, ACL.

[4]  Zdenek Zabokrtský,et al.  Language Richness of the Web , 2012, LREC.

[5]  Slav Petrov,et al.  Multi-Source Transfer of Delexicalized Dependency Parsers , 2011, EMNLP.

[6]  Pavel Pecina,et al.  Simpler unsupervised POS tagging with bilingual projections , 2013, ACL.

[7]  Philip Resnik,et al.  Cross-Language Parser Adaptation between Related Languages , 2008, IJCNLP.

[8]  Dirk Hovy,et al.  If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages , 2015, ACL.

[9]  Philip Resnik,et al.  Bootstrapping parsers via syntactic projection across parallel texts , 2005, Natural Language Engineering.

[10]  Slav Petrov,et al.  Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[11]  Jan Hajic,et al.  UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing , 2016, LREC.

[12]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[13]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[14]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[15]  Daniel Zeman,et al.  If You Even Don’t Have a Bit of Bible: Learning Delexicalized POS Taggers , 2016, LREC.

[16]  David Yarowsky,et al.  Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora , 2001, NAACL.

[17]  Mark Steedman,et al.  Two Decades of Unsupervised POS Induction: How Far Have We Come? , 2010, EMNLP.

[18]  David Yarowsky,et al.  Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day , 2002, CoNLL.

[19]  Steven P. Abney,et al.  Automatically Inducing a Part-of-Speech Tagger by Projecting from Multiple Source Languages Across Aligned Corpora , 2005, IJCNLP.

[20]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .