A Study of Statistical Machine Translation Methods for Under Resourced Languages

Abstract This paper contributes an empirical study of the application of five state-of-the-art machine translation to the trans- lation of low-resource languages. The methods studied were phrase-based, hierarchical phrase-based, the operational sequence model, string-to-tree, tree-to-string statistical machine translation methods between English (en) and the under resourced languages Lao (la), Myanmar (mm), Thai (th) in both directions. The performance of the machine translation systems was automatically measured in terms of BLEU and RIBES for all experiments. Our main findings were that the phrase-based SMT method generally gave the highest BLEU scores. This was counter to expectations, and we believe indicates that this method may be more robust to limitations on the data set size. However, when evaluated with RIBES, the best scores came from methods other than phrase-based SMT, indicating that the other methods were able to handle the word re-ordering better even under the constraint of limited data. Our study achieved the highest reported results on the data sets for all translation language pairs.

[1]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[2]  Khin Mar Soe,et al.  Translation Model of Myanmar Phrases for Statistical Machine Translation , 2011, ICIC.

[3]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[4]  Liang Huang,et al.  A Syntax-Directed Translator with Extended Domain of Locality , 2006 .

[5]  Hermann Ney,et al.  Human Evaluation of Machine Translation Through Binary System Comparisons , 2007, WMT@ACL.

[6]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[7]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[8]  Nadir Durrani,et al.  A Joint Sequence Translation Model with Integrated Reordering , 2011, ACL.

[9]  Wolfgang Macherey,et al.  Lattice-based Minimum Error Rate Training for Statistical Machine Translation , 2008, EMNLP.

[10]  Mei-Yuh Hwang,et al.  Large-Scale Thai Statistical Machine Translation , 2010 .

[11]  Fabienne Braune,et al.  Long-distance reordering during search for hierarchical phrase-based SMT , 2012, EAMT.

[12]  Stanley F. Chen,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[13]  Franz Josef Och,et al.  A Systematic Comparison of Phrase-Based, Hierarchical and Syntax-Augmented Statistical MT , 2008, COLING.

[14]  Philipp Koehn,et al.  Edinburgh’s Submission to all Tracks of the WMT 2009 Shared Task with Reordering and Speed Improvements to Moses , 2009, WMT@EACL.

[15]  Kevin Duh,et al.  Automatic Evaluation of Translation Quality for Distant Language Pairs , 2010, EMNLP.

[16]  Christoph Tillmann,et al.  A Unigram Orientation Model for Statistical Machine Translation , 2004, NAACL.

[17]  Mark Hopkins,et al.  Machine Translation as Tree Labeling , 2007, SSST@HLT-NAACL.

[18]  A. T. Win Words to phrase reordering machine translation system in Myanmar-English using English grammar rules , 2011, 2011 3rd International Conference on Computer Research and Development.

[19]  Mar Soe Khin,et al.  Automatic Reordering Rule Generation and Application of Reordering Rules in Stochastic Reordering Model for English-Myanmar Machine Translation , 2011 .

[20]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[21]  José B. Mariño,et al.  N-gram-based Machine Translation , 2006, CL.

[22]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[23]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.