Large aligned treebanks for syntax-based machine translation

AbstractWe present a collection of parallel treebanks that have been automatically aligned on both the terminal and the non-terminal constituent level for use in syntax-based machine translation. We describe how they were constructed and applied to a syntax- and example-based machine translation system called Parse and Corpus-Based Machine Translation (PaCo-MT). For the language pair Dutch to English, we present non-terminal alignment evaluation scores for a variety of tree alignment approaches. Finally, based on the parallel treebanks created by these approaches, we evaluate the MT system itself and compare the scores with those of Moses, a current state-of-the-art statistical MT system, when trained on the same data.

[1]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[2]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[3]  Aravind K. Joshi,et al.  Mathematical and computational aspects of lexicalized grammars , 1990 .

[4]  Stuart M. Shieber,et al.  Synchronous Tree-Adjoining Grammars , 1990, COLING.

[5]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[6]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[7]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[8]  Christian Boitet,et al.  Ambiguities and ambiguity labelling: Towards ambiguity data bases , 1997 .

[9]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[10]  Arul Menezes,et al.  A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora , 2001, DDMMT@ACL.

[11]  Mitchell Marcus,et al.  Empirical Methods for Exploiting Parallel Texts , 2001 .

[12]  Thomas G. Dietterich Machine Learning for Sequential Data: A Review , 2002, SSPR/SPR.

[13]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[14]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[15]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[16]  Jason Eisner,et al.  Learning Non-Isomorphic Tree Mappings for Machine Translation , 2003, ACL.

[17]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[18]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[19]  Andy Way,et al.  Robust Sub-Sentential Alignment of Phrase-Structure Trees , 2004, COLING.

[20]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[21]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[22]  Stephan Oepen,et al.  Statistical Ranking in Tactical Generation , 2006, EMNLP.

[23]  Induction of Probabilistic Synchronous Tree-Insertion Grammars for Machine Translation , 2006, AMTA.

[24]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[25]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[26]  David Chiang,et al.  An Introduction to Synchronous Grammars , 2006 .

[27]  Gertjan van Noord,et al.  At Last Parsing Is Now Operational , 2006, JEPTALNRECITAL.

[28]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[29]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[30]  Martin Volk,et al.  Using the Stockholm TreeAligner , 2007 .

[31]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[32]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[33]  Martin Volk,et al.  Alignment Tools for Parallel Treebanks , 2007 .

[34]  Josef van Genabith,et al.  Dependency-Based N-Gram Models for General Purpose Sentence Realisation , 2008, COLING.

[35]  Gideon Kotzé,et al.  Complementary approaches to tree alignment. Combining statistical and rule-based methods , 2008 .

[36]  Andy Way,et al.  Automatic Generation of Parallel Treebanks , 2008, COLING.

[37]  Alon Lavie,et al.  Syntax-Driven Learning of Sub-Sentential Translation Equivalents and Translation Rules from Parsed Parallel Corpora , 2008, SSST@ACL.

[38]  Yang Liu,et al.  Improving Tree-to-Tree Translation with Packed Forests , 2009, ACL.

[39]  Jörg Tiedemann,et al.  A Discriminative Approach to Tree Alignment , 2009 .

[40]  Vincent Vandeghinste,et al.  Tree-Based Target Language Modeling , 2009, EAMT.

[41]  Vincent Vandeghinste,et al.  Top-down Transfer in Example-based MT , 2009 .

[42]  Jörg Tiedemann,et al.  Building a Large Machine-Aligned Parallel Treebank , 2009 .

[43]  Jörg Tiedemann,et al.  News from OPUS — A collection of multilingual parallel corpora with tools and interfaces , 2009 .

[44]  Daniel Marcu,et al.  Re-structuring, Re-labeling, and Re-aligning for Syntax-Based Machine Translation , 2010, CL.

[45]  Jun Sun,et al.  Exploring Syntactic Structural Features for Sub-Tree Alignment Using Bilingual Tree Kernels , 2010, ACL.

[46]  Jörg Tiedemann Lingua-Align: An Experimental Toolbox for Automatic Tree-to-Tree Alignment , 2010, LREC.

[47]  Vincent Vandeghinste,et al.  Bottom-up Transfer in Example-based Machine Translation , 2010, EAMT.

[48]  Ventsislav Zhechev,et al.  Automatic Generation of Parallel Treebanks: An Efficient Unsupervised System , 2010 .

[49]  Gideon Kotzé Improving syntactic tree alignment through rule-based error correction , 2011 .

[50]  Gideon Kotzé Rule-induced error correction of aligned parallel treebanks , 2011 .

[51]  Dan Klein,et al.  Transforming Trees to Improve Syntactic Convergence , 2012, EMNLP.

[52]  Tom Vanallemeersch Parser-independent Semantic Tree Alignment , 2012 .

[53]  Rico Sennrich,et al.  Extrinsic evaluation of sentence alignment systems , 2012 .

[54]  Gideon Kotzé Transformation-based tree-to-tree alignment , 2012, CLIN 2012.

[55]  Jörg Tiedemann,et al.  Parse and Corpus-Based Machine Translation , 2013, Essential Speech and Language Technology for Dutch.

[56]  Jingbo Zhu,et al.  Unsupervised Sub-tree Alignment for Tree-to-Tree Translation , 2013, J. Artif. Intell. Res..