Hebrew Morphological Preprocessing for Statistical Machine Translation

This paper presents a range of preprocessing solutions for Hebrew-English statistical machine translation. Our best system, using a morphological analyzer, increases 3.5 BLEU points over a no-tokenization baseline on a blind test set. The next best system uses Morfessor, an unsupervised morphological segmenter, and obtains almost 3.0 BLEU points over the baseline.

[1]  Nizar Habash,et al.  Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged. Arabic Preprocessing Schemes for Statistical Machine Translation , 2006 .

[2]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[3]  Alon Itai,et al.  Language resources for Hebrew , 2008, Lang. Resour. Evaluation.

[4]  Richard M. Schwartz,et al.  Combining Outputs from Multiple Machine Translation Systems , 2007, NAACL.

[5]  Shuly Wintner,et al.  Language Models for Machine Translation: Original vs. Translated Texts , 2011, CL.

[6]  Sharon Goldwater,et al.  Improving Statistical MT through Morphological Analysis , 2005, HLT.

[7]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[8]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[9]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[10]  Mark Fishel,et al.  Linguistically Motivated Unsupervised Segmentation for Machine Translation , 2010, LREC.

[11]  Philipp Koehn,et al.  Empirical Methods for Compound Splitting , 2003, EACL.

[12]  Mathias Creutz,et al.  Unsupervised models for morpheme segmentation and morphology learning , 2007, TSLP.

[13]  Nizar Habash,et al.  Orthographic and morphological processing for English–Arabic statistical machine translation , 2011, Machine Translation.

[14]  Cyril Goutte Automatic Evaluation of Machine Translation Quality , 2006 .

[15]  Mathias Creutz,et al.  Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner , 2007, MTSUMMIT.

[16]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[17]  Young-Suk Lee,et al.  Morphological Analysis for Statistical Machine Translation , 2004, NAACL.

[18]  Wolfgang Macherey,et al.  Lattice-based Minimum Error Rate Training for Statistical Machine Translation , 2008, EMNLP.

[19]  Hermann Ney,et al.  Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information , 2004, CL.

[20]  Smaranda Muresan,et al.  Generalizing Word Lattice Translation , 2008, ACL.

[21]  Yulia Tsvetkov,et al.  Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content , 2010, LREC.

[22]  Alon Lavie,et al.  Extending the METEOR Machine Translation Evaluation Metric to the Phrase Level , 2010, NAACL.

[23]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[24]  Alon Lavie,et al.  The impact of Arabic morphological segmentation on broad-coverage English-to-Arabic statistical machine translation , 2012, AMTA.

[25]  Hermann Ney,et al.  Towards the Use of Word Stems and Suffixes for Statistical Machine Translation , 2004, LREC.

[26]  Nizar Habash,et al.  Translate, Predict or Generate: Modeling Rich Morphology in Statistical Machine Translation , 2012, EAMT.

[27]  Alon Lavie,et al.  Rapid prototyping of a transfer-based Hebrew-to-English machine translation system , 2004, TMI.

[28]  Nizar Habash,et al.  Machine translation between Hebrew and Arabic , 2011, Machine Translation.

[29]  Nizar Habash,et al.  Four Techniques for Online Handling of Out-of-Vocabulary Words in Arabic-English Statistical Machine Translation , 2008, ACL.

[30]  Regina Barzilay,et al.  Unsupervised Multilingual Learning for Morphological Segmentation , 2008, ACL.