Improving statistical machine translation by paraphrasing the training data.

Large amounts of training data are essential for training statistical machine translations systems. In this paper we show how training data can be expanded by paraphrasing one side. The new data is made by parsing then generating using a precise HPSG based grammar, which gives sentences with the same meaning, but minor variations in lexical choice and word order. In experiments with Japanese and English, we showed consistent gains on the Tanaka Corpus with less consistent improvement on the IWSLT 2005 evaluation data.

[1]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[2]  Stephan Oepen,et al.  Statistical Ranking in Tactical Generation , 2006, EMNLP.

[3]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[4]  Ann Copestake,et al.  Implementing typed feature structure grammars , 2001, CSLI lecture notes series.

[5]  Chiori Hori,et al.  Overview of the IWSLT 2005 Evaluation Campaign , 2005, IWSLT.

[6]  Dan Flickinger,et al.  Minimal Recursion Semantics: An Introduction , 2005 .

[7]  James Breen Word Usage Examples in an Electronic Dictionary , 2003 .

[8]  Eiichiro Sumita,et al.  Creating corpora for speech-to-speech translation , 2003, INTERSPEECH.

[9]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[10]  Philipp Koehn,et al.  Clause Restructuring for Statistical Machine Translation , 2005, ACL.

[11]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[12]  Jan Tore Lønning,et al.  Towards hybrid quality-oriented machine translation – on linguistics and probabilities in MT , 2007, TMI.

[13]  Typed feature structure grammars and model generation , .

[14]  Stephan Oepen,et al.  Open Source Machine Translation with DELPH-IN , 2005, MTSUMMIT.

[15]  Dan Flickinger,et al.  On building a more effcient grammar by exploiting types , 2000, Natural Language Engineering.

[16]  Noriko Kando,et al.  Overview of the IWSLT04 evaluation campaign , 2004, IWSLT.

[17]  Y. Tanaka,et al.  Compilation of a multilingual parallel corpus , 2001 .

[18]  Chris Callison-Burch,et al.  Paraphrasing with Bilingual Parallel Corpora , 2005, ACL.

[19]  Preslav Nakov,et al.  Improved Statistical Machine Translation Using Monolingual Paraphrases , 2008, ECAI.

[20]  Hermann Ney,et al.  Morpho-syntactic analysis for reordering in statistical machine translation , 2001, MTSUMMIT.

[21]  Philipp Koehn,et al.  Improved Statistical Machine Translation Using Paraphrases , 2006, NAACL.

[22]  Haifeng Wang,et al.  Pivot Approach for Extracting Paraphrase Patterns from Bilingual Corpora , 2008, ACL.

[23]  Yuji Matsumoto,et al.  Phrase reordering for statistical machine translation based on predicate-argument structure , 2006, IWSLT.

[24]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.