Syntax-based reordering for statistical machine translation

Abstract: In this paper, we develop an approach called syntax-based reordering (SBR) to handling the fundamental problem of word ordering for statistical machine translation (SMT). We propose to alleviate the word order challenge including morpho-syntactical and statistical information in the context of a pre-translation reordering framework aimed at capturing short- and long-distance word distortion dependencies. We examine the proposed approach from the theoretical and experimental points of view discussing and analyzing its advantages and limitations in comparison with some of the state-of-the-art reordering methods. In the final part of the paper, we describe the results of applying the syntax-based model to translation tasks with a great need for reordering (Chinese-to-English and Arabic-to-English). The experiments are carried out on standard phrase-based and alternative N-gram-based SMT systems. We first investigate sparse training data scenarios, in which the translation and reordering models are trained on a sparse bilingual data, then scaling the method to a large training set and demonstrating that the improvement in terms of translation quality is maintained.

[1]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[2]  Mirella Lapata,et al.  Proceedings of EMNLP 2004 , 2004 .

[3]  José B. Mariño,et al.  N-gram-based Machine Translation , 2006, CL.

[4]  Chao Wang,et al.  Chinese Syntactic Reordering for Statistical Machine Translation , 2007, EMNLP.

[5]  Stuart M. Shieber,et al.  Synchronous Tree-Adjoining Grammars , 1990, COLING.

[6]  Philip Resnik,et al.  Proceedings of AMTA 2006 , 2006 .

[7]  José B. Mariño,et al.  TALP phrase-based system and TALP system combination for IWSLT 2006 , 2006, IWSLT.

[8]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.

[9]  Srinivas Bangalore,et al.  Learning Dependency Translation Models as Collections of Finite-State Head Transducers , 2000, Computational Linguistics.

[10]  Alexander M. Fraser,et al.  A Smorgasbord of Features for Statistical Machine Translation , 2004, NAACL.

[11]  Roger K. Moore Computer Speech and Language , 1986 .

[12]  Liang Huang,et al.  Statistical Syntax-Directed Translation with Extended Domain of Locality , 2006, AMTA.

[13]  Chris Callison-Burch,et al.  Demonstration of Joshua: An Open Source Toolkit for Parsing-based Machine Translation , 2009, ACL.

[14]  Stephan Vogel,et al.  Recent Improvements in the CMU Large Scale Chinese-English SMT System , 2008, ACL.

[15]  Hermann Ney,et al.  Integrated Chinese Word Segmentation in Statistical Machine Translation , 2005, IWSLT.

[16]  Christoph Tillmann,et al.  A Unigram Orientation Model for Statistical Machine Translation , 2004, NAACL.

[17]  José B. Mariño,et al.  The TALP-UPC Ngram-Based Statistical Machine Translation System for ACL-WMT 2008 , 2008, WMT@ACL.

[18]  Philipp Koehn,et al.  Improved Statistical Machine Translation Using Paraphrases , 2006, NAACL.

[19]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[20]  T. Pardo,et al.  Statistical Phrase-based Machine Translation : Experiments with Brazilian Portuguese , 2009 .

[21]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[22]  A. Gispert,et al.  Reordered Search, and Tuple Unfolding for Ngram-based SMT , 2005, MTSUMMIT.

[23]  Yaser Al-Onaizan,et al.  Distortion Models for Statistical Machine Translation , 2006, ACL.

[24]  Jonathan Ginzburg,et al.  Proceedings of COLING 2004 , 2004 .

[25]  Kevin Knight,et al.  Training Tree Transducers , 2004, NAACL.

[26]  Grzegorz Rozenberg,et al.  Handbook of Formal Languages , 1997, Springer Berlin Heidelberg.

[27]  Nizar Habash,et al.  Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged. Arabic Preprocessing Schemes for Statistical Machine Translation , 2006 .

[28]  Marta R. Costa-jussà,et al.  Statistical Machine Reordering , 2006, EMNLP.

[29]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[30]  Poetics Today,et al.  Translation Theory Today : A Call for Transfer Theory . " , 2006 .

[31]  Adria de Gispert Ramis Introducing linguistic knowledge into statistical machine translation , 2007 .

[32]  Daniel Marcu,et al.  What’s in a translation rule? , 2004, NAACL.

[33]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[34]  A. R. Fonollosa,et al.  Computing multiple weighted reordering hypotheses for a statistical m achine translation phrase-based system , 2008 .

[35]  José B. Mariño,et al.  The TALP ngram-based SMT system for IWSLT'05 , 2005, IWSLT.

[36]  E. Sumita,et al.  Practical Approach to Syntax-based Statistical Machine Translation , 2005, MTSUMMIT.

[37]  Hermann Ney,et al.  Novel Reordering Approaches in Phrase-Based Statistical Machine Translation , 2005, ParallelText@ACL.

[38]  Ming Zhou,et al.  A Probabilistic Approach to Syntax-based Reordering for Statistical Machine Translation , 2007, ACL.

[39]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[40]  José B. Mariño,et al.  Improving statistical MT by coupling reordering and decoding , 2006, Machine Translation.

[41]  Fei Xia,et al.  Improving a Statistical MT System with Automatically Learned Rewrite Patterns , 2004, COLING.

[42]  Jason Eisner,et al.  Learning Non-Isomorphic Tree Mappings for Machine Translation , 2003, ACL.

[43]  José Clemente Architecture and modeling for n-gram-based statistical machine translation , 2008 .

[44]  Aravind K. Joshi,et al.  Tree-Adjoining Grammars , 1997, Handbook of Formal Languages.

[45]  Hermann Ney,et al.  Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information , 2004, CL.

[46]  Philipp Koehn,et al.  Clause Restructuring for Statistical Machine Translation , 2005, ACL.

[47]  J. Mariño,et al.  Syntax-enhanced n-gram-based SMT , 2007, MTSUMMIT.

[48]  Hermann Ney,et al.  Statistical Machine Translation with a Small Amount of Bilingual Training Data , 2006 .

[49]  Philipp Koehn,et al.  Explorer Edinburgh System Description for the 2005 IWSLT Speech Translation Evaluation , 2005 .

[50]  José B. Mariño,et al.  The TALP Ngram-based SMT System for IWSLT 2006 , 2006 .

[51]  Kevin Knight,et al.  A Syntax-based Statistical Translation Model , 2001, ACL.