One model, two languages: training bilingual parsers with harmonized treebanks

We introduce an approach to train lexicalized parsers using bilingual corpora obtained by merging harmonized treebanks of different languages, producing parsers that can analyze sentences in either of the learned languages, or even sentences that mix both. We test the approach on the Universal Dependency Treebanks, training with MaltParser and MaltOptimizer. The results show that these bilingual parsers are more than competitive, as most combinations not only preserve accuracy, but some even achieve significant improvements over the corresponding monolingual parsers. Preliminary experiments also show the approach to be promising on texts with code-switching and when more languages are added.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[3]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[4]  Noah A. Smith,et al.  Bilingual Parsing with Factored Estimation: Using English to Parse Korean , 2004, EMNLP.

[5]  Yuji Matsumoto MaltParser: A language-independent system for data-driven dependency parsing , 2005 .

[6]  Joakim Nivre,et al.  MaltParser: A Language-Independent System for Data-Driven Dependency Parsing , 2007, Natural Language Engineering.

[7]  Joakim Nivre,et al.  Two Strategies for Text Parsing , 2006 .

[8]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[9]  Joakim Nivre,et al.  Algorithms for Deterministic Incremental Dependency Parsing , 2008, CL.

[10]  Hitoshi Isahara,et al.  Learning Reliable Information for Dependency Parsing Adaptation , 2008, COLING.

[11]  Dan Klein,et al.  Two Languages are Better than One (for Syntactic Parsing) , 2008, EMNLP.

[12]  Philip Resnik,et al.  Cross-Language Parser Adaptation between Related Languages , 2008, IJCNLP.

[13]  Slav Petrov,et al.  Multi-Source Transfer of Delexicalized Dependency Parsers , 2011, EMNLP.

[14]  Bharat Ram Ambati,et al.  Exploring self training for Hindi dependency parsing , 2011, IJCNLP.

[15]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[16]  Daniel Zeman,et al.  HamleDT: To Parse or Not to Parse? , 2012, LREC.

[17]  Regina Barzilay,et al.  Selective Sharing for Multilingual Dependency Parsing , 2012, ACL.

[18]  Joakim Nivre,et al.  MaltOptimizer: An Optimization Tool for MaltParser , 2012, EACL.

[19]  Joakim Nivre,et al.  Universal Dependency Annotation for Multilingual Parsing , 2013, ACL.

[20]  Paul Jen-Hwa Hu,et al.  An integrated framework for analyzing multilingual content in Web 2.0 social media , 2014, Decis. Support Syst..

[21]  Jennifer Foster,et al.  Cross-lingual Transfer Parsing for Low-Resourced Languages: An Irish Case Study , 2014 .

[22]  Jörg Tiedemann,et al.  Rediscovering Annotation Projection for Cross-Lingual Parser Induction , 2014, COLING.

[23]  Noah A. Smith,et al.  Many Languages, One Parser , 2016, TACL.