German Compounds in Factored Statistical Machine Translation

An empirical method for splitting German compounds is explored by varying it in a number of ways to investigate the consequences for factored statistical machine translation between English and German in both directions. Compound splitting is incorporated into translation in a preprocessing step, performed on training data and on German translation input. For translation into German, compounds are merged based on part-of-speech in a postprocessing step. Compound parts are marked, to separate them from ordinary words. Translation quality is improved in both translation directions and the number of untranslated words in the English output is reduced. Different versions of the splitting algorithm performs best in the two different translation directions.

[1]  Philipp Koehn,et al.  Empirical Methods for Compound Splitting , 2003, EACL.

[2]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[3]  Sara Stymne,et al.  Effects of Morphological Analysis in Translation between German and English , 2008, WMT@ACL.

[4]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[5]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[6]  Sara Stymne,et al.  Getting to Know Moses: Initial Experiments on German-English Factored Translation , 2007, WMT@ACL.

[7]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[8]  Stefan Langer,et al.  Zur Morphologie und Semantik von Nominalkomposita , 1998 .

[9]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[10]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[11]  Hermann Ney,et al.  Statistical Machine Translation of German Compound Words , 2006, FinTAL.

[12]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[13]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[14]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[15]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.