Diachronic proximity vs. data sparsity in cross-lingual parser projection. A case study on Germanic

For the study of historical language varieties, the sparsity of training data imposes immense problems on syntactic annotation and the development of NLP tools that automatize the process. In this paper, we explore strategies to compensate the lack of training data by including data from related varieties in a series of annotation projection experiments from English to four old Germanic languages: On dependency syntax projected from English to one or multiple language(s), we train a fragment-aware parser trained and apply it to the target language. For parser training, we consider small datasets from the target language as a baseline, and compare it with models trained on larger datasets from multiple varieties with different degrees of relatedness, thereby balancing sparsity and diachronic proximity. Our experiments show (a) that including related language data to training data in the target language can improve parsing performance,

[1]  Beatrice Santorini,et al.  The Penn Treebank: An Overview , 2003 .

[2]  A. Kemenade,et al.  The Handbook of the History of English , 2006 .

[3]  Philip Resnik,et al.  Cross-Language Parser Adaptation between Related Languages , 2008, IJCNLP.

[4]  Susan Pintzuk,et al.  The York-Toronto-Helsinki Parsed Corpus of Old English , 2003 .

[5]  Richard Johansson,et al.  Extended Constituent-to-Dependency Conversion for English , 2007, NODALIDA.

[6]  Eiríkur Rögnvaldsson,et al.  The Icelandic Parsed Historical Corpus (IcePaHC) , 2012, LREC.

[7]  Taro Watanabe,et al.  Machine Translation without Words through Substring Alignment , 2012, ACL.

[8]  David Yarowsky,et al.  Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora , 2001, NAACL.

[9]  Michael Collins,et al.  A Statistical Parser for Czech , 1999, ACL.

[10]  R. Meyer,et al.  New wine in old wineskins?—Tagging Old Russian via annotation projection from modern translations , 2011 .

[11]  P. Resnik,et al.  Creating a Parallel Corpus from the \ Book of 2000 Tongues " , 1998 .

[12]  Walter Daelemans,et al.  Weigh your words - memory-based lemmatization for Middle Dutch , 2010, Lit. Linguistic Comput..

[13]  Michael Cummings An Introduction to the Grammar of Old English: A Systemic Functional Approach , 2010 .

[14]  Fabio Massimo Zanzotto,et al.  Natural Language Processing Across Time: An Empirical Investigation on Italian , 2008, GoTAL.

[15]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[16]  András Kocsor,et al.  Sentence Alignment of Hungarian-English Parallel Corpora Using a Hybrid Algorithm , 2008, Acta Cybern..

[17]  Joakim Nivre,et al.  MaltOptimizer: A System for MaltParser Optimization , 2012, LREC.

[18]  Dawn Archer,et al.  Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora , 2007 .

[19]  Stefanie Dipper,et al.  Rule-Based Normalization of Historical Texts , 2011 .

[20]  C. Trips From OV to VO in Early Middle English , 2002 .

[21]  Daisuke Kawahara,et al.  Minimally Lexicalized Dependency Parsing , 2007, ACL.

[22]  Joakim Nivre,et al.  An Efficient Algorithm for Projective Dependency Parsing , 2003, IWPT.

[23]  Marcel Bollmann,et al.  POS Tagging for Historical Texts with Sparse Training Data , 2013, LAW@ACL.

[24]  Yuji Matsumoto,et al.  Statistical Dependency Analysis with Support Vector Machines , 2003, IWPT.

[25]  神谷 昌明,et al.  古英語に現れる小節・結果構文 : York-Toronto-Helsinki Parsed Corpus of Old English Proseを検索して , 2008 .

[26]  Robert Peter Ebert Infinitival complement constructions in early new high German , 1976 .

[27]  Paul Bennett,et al.  Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text , 2011, LaTeCH@ACL.

[28]  Lilja Øvrelid,et al.  Training Parsers on Partial Trees: A Cross-language Comparison , 2010, LREC.