Word-Order Issues in English-to-Urdu Statistical Machine Translation

Word-Order Issues in English-to-Urdu Statistical Machine Translation We investigate phrase-based statistical machine translation between English and Urdu, two Indo-European languages that differ significantly in their word-order preferences. Reordering of words and phrases is thus a necessary part of the translation process. While local reordering is modeled nicely by phrase-based systems, long-distance reordering is known to be a hard problem. We perform experiments using the Moses SMT system and discuss reordering models available in Moses. We then present our novel, Urdu-aware, yet generalizable approach based on reordering phrases in syntactic parse tree of the source English sentence. Our technique significantly improves quality of English-Urdu translation with Moses, both in terms of BLEU score and of subjective human judgments.

[1]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[2]  Brendan S. Gillon Review of Natural language processing: a Paninian perspective by Akshar Bharati, Vineet Chaitanya, and Rajeev Sangal. Prentice-Hall of India 1995. , 1995 .

[3]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[4]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[5]  Daniel Zeman,et al.  English–Hindi Translation in 21 Days , 2008 .

[6]  Sandy Lovie Shannon, Claude E , 2005 .

[7]  Pushpak Bhattacharyya,et al.  Simple Syntactic and Morphological Processing Can Help English-Hindi Statistical Machine Translation , 2008, IJCNLP.

[8]  Bushra Jawaid,et al.  Rule Based English to Urdu Machine Translation , 2007 .

[9]  Daniel Zeman Using TectoMT as a Preprocessing Tool for Phrase-Based Statistical Machine Translation , 2010, TSD.

[10]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[11]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[12]  Tony McEnery,et al.  EMILLE, A 67-Million Word Corpus of Indic Languages: Data Collection, Mark-up and Harmonisation , 2002, LREC.

[13]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[14]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[15]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[16]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[17]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .