Multi-source synthetic treebank creation for improved cross-lingual dependency parsing

This paper describes a method of creating synthetic treebanks for cross-lingual dependency parsing using a combination of machine translation (including pivot translation), annotation projection and the spanning tree algorithm. Sentences are first automatically translated from a lesser-resourced language to a number of related highly-resourced languages, parsed and then the annotations are projected back to the lesser-resourced language, leading to multiple trees for each sentence from the lesser-resourced language. The final treebank is created by merging the possible trees into a graph and running the spanning tree algorithm to vote for the best tree for each sentence. We present experiments aimed at parsing Faroese using a combination of Danish, Swedish and Norwegian. In a similar experimental setup to the CoNLL 2018 shared task on dependency parsing we report state-of-the-art results on dependency parsing for Faroese using an off-the-shelf parser.

[1]  Héctor Martínez Alonso,et al.  Universal Dependencies for Danish , 2015 .

[2]  Lene Antonsen,et al.  Reusing Grammatical Resources for New Languages , 2010, LREC.

[3]  Barbara Plank,et al.  Multilingual Projection for Parsing Truly Low-Resource Languages , 2016, TACL.

[4]  Michael Sejr Schlichtkrull,et al.  Cross-Lingual Dependency Parsing with Late Decoding for Truly Low-Resource Languages , 2017, EACL.

[5]  Lilja Øvrelid,et al.  Universal Dependencies for Norwegian , 2016, LREC.

[6]  Trond Trosterud,et al.  Reuse of free resources in machine translation between Nynorsk and Bokmål , 2009, FREEOPMT.

[7]  Milan Straka,et al.  Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe , 2017, CoNLL.

[8]  Francis M. Tyers,et al.  Apertium: a free/open-source platform for rule-based machine translation , 2011, Machine Translation.

[9]  Jörg Tiedemann,et al.  OPUS – parallel corpora for everyone , 2016, EAMT.

[10]  Trond Trosterud A constraint grammar for Faroese , 2009 .

[11]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[12]  Sebastian Riedel,et al.  The CoNLL 2007 Shared Task on Dependency Parsing , 2007, EMNLP.

[13]  Jörg Tiedemann,et al.  Synthetic Treebanking for Cross-Lingual Dependency Parsing , 2016, J. Artif. Intell. Res..

[14]  Barbara Plank,et al.  Parsing Universal Dependencies without training , 2017, EACL.

[15]  Philip Resnik,et al.  Cross-Language Parser Adaptation between Related Languages , 2008, IJCNLP.

[16]  Slav Petrov,et al.  Multi-Source Transfer of Delexicalized Dependency Parsers , 2011, EMNLP.

[17]  Jörg Tiedemann Cross-lingual dependency parsing for closely related languages - Helsinki's submission to VarDial 2017 , 2017, VarDial.

[18]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[19]  Fernando Pereira,et al.  Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[20]  Francis M. Tyers,et al.  A Dependency Treebank for Kurmanji Kurdish , 2017, DepLing.