论文信息 - Word-Order Issues in English-to-Urdu Statistical Machine Translation

Word-Order Issues in English-to-Urdu Statistical Machine Translation

Word-Order Issues in English-to-Urdu Statistical Machine Translation We investigate phrase-based statistical machine translation between English and Urdu, two Indo-European languages that differ significantly in their word-order preferences. Reordering of words and phrases is thus a necessary part of the translation process. While local reordering is modeled nicely by phrase-based systems, long-distance reordering is known to be a hard problem. We perform experiments using the Moses SMT system and discuss reordering models available in Moses. We then present our novel, Urdu-aware, yet generalizable approach based on reordering phrases in syntactic parse tree of the source English sentence. Our technique significantly improves quality of English-Urdu translation with Moses, both in terms of BLEU score and of subjective human judgments.

Daniel Zeman | Bushra Jawaid | Daniel Zeman | B. Jawaid

[1] Hermann Ney,et al. Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[2] Brendan S. Gillon. Review of Natural language processing: a Paninian perspective by Akshar Bharati, Vineet Chaitanya, and Rajeev Sangal. Prentice-Hall of India 1995. , 1995 .

[3] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[4] Sang Joon Kim,et al. A Mathematical Theory of Communication , 2006 .

[5] Daniel Zeman,et al. English–Hindi Translation in 21 Days , 2008 .

[6] Sandy Lovie. Shannon, Claude E , 2005 .

[7] Pushpak Bhattacharyya,et al. Simple Syntactic and Morphological Processing Can Help English-Hindi Statistical Machine Translation , 2008, IJCNLP.

[8] Bushra Jawaid,et al. Rule Based English to Urdu Machine Translation , 2007 .

[9] Daniel Zeman. Using TectoMT as a Preprocessing Tool for Phrase-Based Statistical Machine Translation , 2010, TSD.

[10] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[11] Daniel Marcu,et al. Statistical Phrase-Based Translation , 2003, NAACL.

[12] Tony McEnery,et al. EMILLE, A 67-Million Word Corpus of Indic Languages: Data Collection, Mark-up and Harmonisation , 2002, LREC.

[13] Philip Koehn,et al. Statistical Machine Translation , 2010, EAMT.

[14] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[15] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[16] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[17] James H. Martin,et al. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .