Using TectoMT as a Preprocessing Tool for Phrase-Based Statistical Machine Translation

We present a systematic comparison of preprocessing techniques for two language pairs: English-Czech and English-Hindi. The two target languages, although both belonging to the Indo-European language family, show significant differences in morphology, syntax and word order. We describe how TectoMT, a successful framework for analysis and generation of language, can be used as preprocessor for a phrase-based MT system. We compare the two language pairs and the optimal sets of source-language transformations applied to them. The following transformations are examples of possible preprocessing steps: lemmatization; retokenization, compound splitting; removing/adding words lacking counterparts in the other language; phrase reordering to resemble the target word order; marking syntactic functions. TectoMT, as well as all other tools and data sets we use, are freely available on the Web.

[1]  Sharon Goldwater,et al.  Improving Statistical MT through Morphological Analysis , 2005, HLT.

[2]  Pushpak Bhattacharyya,et al.  Case markers and Morphology: Addressing the crux of the fluency problem in English-Hindi SMT , 2009, ACL.

[3]  Hermann Ney,et al.  Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information , 2004, CL.

[4]  Philipp Koehn,et al.  Clause Restructuring for Statistical Machine Translation , 2005, ACL.

[5]  Hermann Ney,et al.  Augmenting a Small Parallel Text with Morpho-Syntactic Language , 2005, ParallelText@ACL.

[6]  Kevin Duh,et al.  The University of Washington Machine Translation System for ACL WMT 2008 , 2008, WMT@ACL.

[7]  Fernando Pereira,et al.  Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[8]  Sanjeev Khudanpur,et al.  Decoding in Joshua: Open Source, Parsing-Based Machine Translation , 2009, Prague Bull. Math. Linguistics.

[9]  Petr Pajas,et al.  TectoMT: Highly Modular MT System with Tectogrammatics Used as Transfer Layer , 2008, WMT@ACL.

[10]  Hermann Ney,et al.  The RWTH Machine Translation System for WMT 2009 , 2009, WMT@EACL.

[11]  Nizar Habash,et al.  Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged. Arabic Preprocessing Schemes for Statistical Machine Translation , 2006 .

[12]  Philipp Koehn,et al.  Enriching Morphologically Poor Languages for Statistical Machine Translation , 2008, ACL.

[13]  Hermann Ney,et al.  Morpho-syntactic Arabic Preprocessing for Arabic to English Statistical Machine Translation , 2006, WMT@HLT-NAACL.