Simultaneous Word-Morpheme Alignment for Statistical Machine Translation

Current word alignment models for statistical machine translation do not address morphology beyond merely splitting words. We present a two-level alignment model that distinguishes between words and morphemes, in which we embed an IBM Model 1 inside an HMM based word alignment model. The model jointly induces word and morpheme alignments using an EM algorithm. We evaluated our model on Turkish-English parallel data. We obtained significant improvement of BLEU scores over IBM Model 4. Our results indicate that utilizing information from morphology improves the quality of word alignments.

[1]  D. Cox,et al.  Statistical significance tests. , 1982, British journal of clinical pharmacology.

[2]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[3]  Kemal Oflazer,et al.  Two-level Description of Turkish Morphology , 1993, EACL.

[4]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[5]  Kristina Toutanova,et al.  Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models , 2011, ACL.

[6]  Young-Suk Lee,et al.  Morphological Analysis for Statistical Machine Translation , 2004, NAACL.

[7]  Hermann Ney,et al.  Improving SMT quality with morpho-syntactic analysis , 2000, COLING.

[8]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[9]  Robert C. Moore Improving IBM Word Alignment Model 1 , 2004, ACL.

[10]  Martin Cmejrek,et al.  Czech-English dependency-based machine translation , 2003 .

[11]  Kemal Oflazer,et al.  Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish , 2010, ACL.

[12]  Sharon Goldwater,et al.  Improving Statistical MT through Morphological Analysis , 2005, HLT.

[13]  Murat Saraclar,et al.  Morphological Disambiguation of Turkish Text with Perceptron Algorithm , 2009, CICLing.

[14]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[15]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[16]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[17]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[18]  Kemal Oflazer Two-level description of Turkish morphology , 1993 .

[19]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[20]  Mark Johnson,et al.  Why Doesn’t EM Find Good HMM POS-Taggers? , 2007, EMNLP.

[21]  Daniel Gildea,et al.  Unsupervised Tokenization for Machine Translation , 2009, EMNLP.