Automatically generated parallel treebanks and their exploitability in machine translation

Given much recent discussion and the shift in focus of the field, it is becoming apparent that the incorporation of syntax is the way forward for improvements to the current state-of-the-art in machine translation (MT). Parallel treebanks are a relatively recent innovation and appear to be ideal candidates for MT training material. However, until recently there has been no other means to build them than by hand. In this paper, we describe how we make use of new tools to automatically build a large parallel treebank and extract a set of linguistically-motivated phrase pairs from it. We show that adding these phrase pairs to the translation model of a baseline phrase-based statistical MT (PB-SMT) system leads to significant improvements in translation quality. Following this, we describe experiments in which we exploit the information encoded in the parallel treebank in other areas of the PB-SMT framework, while investigating the conditions under which the incorporation of parallel treebank data performs optimally. Finally, we discuss the possibility of exploiting automatically-generated parallel treebanks further in syntax-aware paradigms of MT.

[1]  Franz Josef Och,et al.  A Systematic Comparison of Phrase-Based, Hierarchical and Syntax-Augmented Statistical MT , 2008, COLING.

[2]  Alon Lavie,et al.  Decoding with Syntactic and Non-Syntactic Phrases in a Syntax-Based Machine Translation System , 2009, SSST@HLT-NAACL.

[3]  M. Volk,et al.  Bootstrapping Parallel Treebanks , 2004, COLING 2004.

[4]  Josef van Genabith,et al.  Using Machine-Learning to Assign Function Labels to Parser Output for Spanish , 2006, ACL.

[5]  Andy Way,et al.  Robust language pair-independent sub-tree alignment , 2007, MTSUMMIT.

[6]  Jan Hajic,et al.  Prague Czech-English Dependency Treebank. Syntactically Annotated Resources for Machine Translation , 2004, LREC.

[7]  M. Nagao,et al.  Machine Translation Summit , 1989 .

[8]  Kevin Knight,et al.  A Syntax-based Statistical Translation Model , 2001, ACL.

[9]  Mihaela Vela,et al.  Multi-dimensional Annotation and Alignment in an English-German Translation Corpus , 2006, NLPXML@EACL.

[10]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[11]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[12]  Andy Way,et al.  Supertagged Phrase-Based Statistical Machine Translation , 2007, ACL.

[13]  Martin Volk,et al.  Alignment Tools for Parallel Treebanks , 2007 .

[14]  Hermann Ney,et al.  Analysing soft syntax features and heuristics for hierarchical phrase based machine translation. , 2008, IWSLT.

[15]  M. T. Lino,et al.  Proceedings of the 4th International Conference on Language Resources and Evaluation , 2004 .

[16]  Lars Ahrenberg,et al.  LinES: An English-Swedish Parallel Treebank , 2007, NODALIDA.

[17]  Andy Way,et al.  Exploiting Parallel Treebanks to Improve Phrase-Based Statistical Machine Translation , 2009, CICLing.

[18]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[19]  Qun Liu,et al.  Improving Statistical Machine Translation Performance by Training Data Selection and Optimization , 2007, EMNLP-CoNLL.

[20]  Martha Palmer,et al.  Development and Evaluation of a Korean Treebank and its Application to NLP , 2002, LREC.

[21]  Andy Way,et al.  Capturing translational divergences with a statistical tree-to-tree aligner , 2007 .

[22]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[23]  Alex Waibel,et al.  Low Cost Portability for Statistical Machine Translation based on N-gram Coverage , 2005 .

[24]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[25]  Mark Steedman,et al.  Example Selection for Bootstrapping Statistical Parsers , 2003, NAACL.

[26]  Alon Lavie,et al.  Stat-XFER: A General Search-Based Syntax-Driven Framework for Machine Translation , 2008, CICLing.

[27]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[28]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[29]  Andy Way,et al.  Hybrid data-driven models of machine translation , 2005, Machine Translation.

[30]  Cyril Goutte Automatic Evaluation of Machine Translation Quality , 2006 .

[31]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[32]  Lea Cyrus,et al.  FuSe – a Multi-Layered Parallel Treebank , 2022 .

[33]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[34]  Alon Lavie,et al.  Syntax-Driven Learning of Sub-Sentential Translation Equivalents and Translation Rules from Parsed Parallel Corpora , 2008, SSST@ACL.

[35]  Andy Way,et al.  Exploiting source similarity for SMT using context-informed features , 2007, TMI.

[36]  Joel D. Martin,et al.  Improving Translation Quality by Discarding Most of the Phrasetable , 2007, EMNLP.

[37]  Philip Resnik,et al.  Soft Syntactic Constraints for Hierarchical Phrased-Based Translation , 2008, ACL.

[38]  Mary Hearne,et al.  Comparing Constituency and Dependency Representations for SMT Phrase-Extraction , 2008, JEPTALNRECITAL.

[39]  Daniel Marcu,et al.  Scalable Inference and Training of Context-Rich Syntactic Translation Models , 2006, ACL.

[40]  Marine Carpuat,et al.  How phrase sense disambiguation outperforms word sense disambiguation for statistical machine translation , 2007, TMI.

[41]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[42]  Daniel M. Bikel,et al.  Design of a multi-lingual, parallel-processing statistical parsing engine , 2002 .

[43]  Andy Way,et al.  Automatic Generation of Parallel Treebanks , 2008, COLING.

[44]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[45]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[46]  Montserrat Civit,et al.  Building Cast3LB: A Spanish Treebank , 2004 .

[47]  Philipp Koehn,et al.  Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , 2007 .

[48]  Mary Hearne,et al.  Data-oriented models of parsing and translation , 2005 .