论文信息 - Augmenting Performance of SMT Models by Deploying Fine Tokenization of the Text and Part-of-Speech Tag

Augmenting Performance of SMT Models by Deploying Fine Tokenization of the Text and Part-of-Speech Tag

This paper presents our study of exploiting the languages’ word class information augmented with some rule-based processing for phrase-based Statistical Machine Translation (SMT). In statistical machine translation, estimating word-to-word alignment probabilities for the translation model can be difficult due to the problem of sparse data: most words in a given corpus occur at most a handful of times. With a highly inflected language such as Oromo, this problem can be particularly severe. In addition, there is variant nature or use of different symbols for ‘hudhaa’ (the diacritical marker) in Oromo language which intrudes another severe data sparsity problem. In this work, we show that using fine tokenization of words considering intra-word behavior of words consisting hudhaa, and POS tag to modify the Oromo input and see how it improves Oromo-English machine translation system. The models were trained on a very small parallel corpus of data set (usually unacceptable for normal SMT system) and also the quality of the parallel corpus both in translation and spelling errors were not so good. Yet, our final system achieves a BLEU score of 2.88, as compared to 2.56 for the baseline system.

Degen Huang | Abraham Tesso Nedjo | Degen Huang

[1] Hermann Ney,et al. Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information , 2004, CL.

[2] D. Sudakin,et al. Appendix A , 2007, Journal of agromedicine.

[3] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[4] Pushpak Bhattacharyya,et al. Case markers and Morphology: Addressing the crux of the fluency problem in English-Hindi SMT , 2009, ACL.

[5] Hermann Ney,et al. The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[6] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[7] Joe F. Zhou,et al. Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, : 21-22 June 1999, University of Maryland, College Park, MD, USA , 1999 .

[8] Pushpak Bhattacharyya,et al. Simple Syntactic and Morphological Processing Can Help English-Hindi Statistical Machine Translation , 2008, IJCNLP.

[9] Hermann Ney,et al. Morpho-syntactic Arabic Preprocessing for Arabic to English Statistical Machine Translation , 2006, WMT@HLT-NAACL.

[10] Mauro Cettolo,et al. IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[11] Akira Shimazu,et al. Improving Phrase-Based SMT with Morpho-Syntactic Analysis and Transformation , 2006 .