Capturing translational divergences with a statistical tree-to-tree aligner

Parallel treebanks, which comprise paired source-target parse trees aligned at sub-sentential level, could be useful for many applications, particularly data-driven machine translation. In this paper, we focus on how translational divergences are captured within a parallel treebank using a fully automatic statistical tree-to-tree aligner. We observe that while the algorithm performs well at the phrase level, performance on lexical-level alignments is compromised by an inappropriate bias towards coverage rather than precision. This preference for high precision rather than broad coverage in terms of expressing translational divergences through tree-alignment stands in direct opposition to the situation for SMT word-alignment models. We suggest that this has implications not only for tree-alignment itself but also for the broader area of induction of syntaxaware models for SMT.

[1]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[2]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[3]  H. Altay Güvenir,et al.  Learning Translation Templates from Bilingual Translation Examples , 2004, Applied Intelligence.

[4]  Daniel Marcu,et al.  SPMT: Statistical Machine Translation with Syntactified Target Language Phrases , 2006, EMNLP.

[5]  Ben Taskar,et al.  Alignment by Agreement , 2006, NAACL.

[6]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[7]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[8]  Andy Way,et al.  Robust language pair-independent sub-tree alignment , 2007, MTSUMMIT.

[9]  M. Volk,et al.  Bootstrapping Parallel Treebanks , 2004, COLING 2004.

[10]  Dekai Wu,et al.  Bracketing and aligning words and constituents in parallel text using Stochastic Inversion Transduction Grammars , 2000 .

[11]  Arturo Trujillo Translation Engines: Techniques for Machine Translation , 1999 .

[12]  I. Dan Melamed Annotation Style Guide for the Blinker Project , 1998, ArXiv.

[13]  Daniel Marcu,et al.  Scalable Inference and Training of Context-Rich Syntactic Translation Models , 2006, ACL.

[14]  Cyril Goutte Automatic Evaluation of Machine Translation Quality , 2006 .

[15]  Martin Volk,et al.  Phrase Alignment in Parallel Treebanks , 2006 .

[16]  Daniel M. Bikel,et al.  Design of a multi-lingual, parallel-processing statistical parsing engine , 2002 .

[17]  Bonnie J. Dorr,et al.  Machine Translation Divergences: A Formal Description and Proposed Solution , 1994, CL.

[18]  I. Dan Melamed,et al.  Statistical Machine Translation by Parsing , 2004, ACL.

[19]  Yanjun Ma,et al.  Bootstrapping Word Alignment via Word Packing , 2007, ACL.

[20]  Harold L. Somers,et al.  An introduction to machine translation , 1992 .

[21]  Nizar Habash,et al.  DUSTer: a method for unraveling cross-language divergences for statistical word-level alignment , 2002, AMTA.

[22]  I. Dan Melamed,et al.  Empirical Lower Bounds on the Complexity of Translational Equivalence , 2006, ACL.

[23]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[24]  Andy Way,et al.  Disambiguation Strategies for Data-Oriented Translation , 2006, EAMT.

[25]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[26]  Andreas Zollmann,et al.  Syntax Augmented Machine Translation via Chart Parsing , 2006, WMT@HLT-NAACL.

[27]  Jörg Tiedemann Word to word alignment strategies , 2004, COLING.