Augmenting Performance of SMT Models by Deploying Fine Tokenization of the Text and Part-of-Speech Tag

This paper presents our study of exploiting the languages’ word class information augmented with some rule-based processing for phrase-based Statistical Machine Translation (SMT). In statistical machine translation, estimating word-to-word alignment probabilities for the translation model can be difficult due to the problem of sparse data: most words in a given corpus occur at most a handful of times. With a highly inflected language such as Oromo, this problem can be particularly severe. In addition, there is variant nature or use of different symbols for ‘hudhaa’ (the diacritical marker) in Oromo language which intrudes another severe data sparsity problem. In this work, we show that using fine tokenization of words considering intra-word behavior of words consisting hudhaa, and POS tag to modify the Oromo input and see how it improves Oromo-English machine translation system. The models were trained on a very small parallel corpus of data set (usually unacceptable for normal SMT system) and also the quality of the parallel corpus both in translation and spelling errors were not so good. Yet, our final system achieves a BLEU score of 2.88, as compared to 2.56 for the baseline system.

[1]  Hermann Ney,et al.  Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information , 2004, CL.

[2]  D. Sudakin,et al.  Appendix A , 2007, Journal of agromedicine.

[3]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[4]  Pushpak Bhattacharyya,et al.  Case markers and Morphology: Addressing the crux of the fluency problem in English-Hindi SMT , 2009, ACL.

[5]  Hermann Ney,et al.  The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[6]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[7]  Joe F. Zhou,et al.  Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, : 21-22 June 1999, University of Maryland, College Park, MD, USA , 1999 .

[8]  Pushpak Bhattacharyya,et al.  Simple Syntactic and Morphological Processing Can Help English-Hindi Statistical Machine Translation , 2008, IJCNLP.

[9]  Hermann Ney,et al.  Morpho-syntactic Arabic Preprocessing for Arabic to English Statistical Machine Translation , 2006, WMT@HLT-NAACL.

[10]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[11]  Akira Shimazu,et al.  Improving Phrase-Based SMT with Morpho-Syntactic Analysis and Transformation , 2006 .

[12]  Degen Huang,et al.  Automatic Part-of-speech Tagging for Oromo Language Using Maximum Entropy Markov Model (MEMM) ⋆ , 2014 .

[13]  Sharon Goldwater,et al.  Improving Statistical MT through Morphological Analysis , 2005, HLT.

[14]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[15]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[16]  Kemal Oflazer,et al.  Initial Explorations in English to Turkish Statistical Machine Translation , 2006, WMT@HLT-NAACL.

[17]  Xiaoxia Liu,et al.  Challenges of Diacritical Marker or Hudhaa Character in Tokenization of Oromo Text , 2014, J. Softw..

[18]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[19]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[20]  Young-Suk Lee,et al.  Morphological Analysis for Statistical Machine Translation , 2004, NAACL.