Pause-Based Phrase Extraction and Effective OOV Handling for Low-Resource Machine Translation Systems

Machine translation is the core problem for several natural language processing research across the globe. However, building a translation system involving low-resource languages remains a challenge with respect to statistical machine translation (SMT). This work proposes and studies the effect of a phrase-induced hybrid machine translation system for translation from English to Tamil, under a low-resource setting. Unlike conventional hybrid MT systems, the free-word ordering feature of the target language Tamil is exploited to form a re-ordered target language model and to extend the parallel text corpus for training the SMT. In the current work, a novel rule-based phrase-extraction method, implemented using parts-of-speech (POS) and place-of-pause in both languages is proposed, which is used to pre-process the training corpus for developing the back-off phrase-induced SMT. Further, out-of-vocabulary (OOV) words are handled using speech-based transliteration and two-level thesaurus intersection techniques based on the POS tag of the OOV word. To ensure that the input with OOV words does not skip phrase-level translation in the hierarchical model, a phrase-level example-based machine translation approach is adopted to find the closest matching phrase and perform translation followed by OOV replacement. The proposed system results in a bilingual evaluation understudy score of 84.78 and a translation edit rate of 19.12. The performance of the system is compared in terms of adequacy and fluency, with existing translation systems for this specific language pair, and it is observed that the proposed system outperforms its counterparts.

[1]  Lori Levin,et al.  A Trainable Transfer-based Machine Translation Approach for Languages with Limited Resources , 2004 .

[2]  K. P. Soman,et al.  Improving the Performance of English-Tamil Statistical Machine Translation System using Source-Side Pre-Processing , 2014, ArXiv.

[3]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[4]  Eiichiro Sumita,et al.  Introducing translation dictionary into phrase-based SMT , 2008, MTSUMMIT.

[5]  T. Nagarajan,et al.  A small-footprint context-independent HMM-based synthesizer for Tamil , 2015, Int. J. Speech Technol..

[6]  Bohn Stafleu van Loghum Google translate , 2017 .

[7]  Ann Irvine Statistical Machine Translation in Low Resource Settings , 2013, HLT-NAACL.

[8]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[9]  Sneha Tripathi,et al.  Approaches to machine translation , 2010 .

[10]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[11]  Peng Xu,et al.  Using a Dependency Parser to Improve SMT for Subject-Object-Verb Languages , 2009, NAACL.

[12]  Chris Callison-Burch,et al.  Combining Bilingual and Comparable Corpora for Low Resource Machine Translation , 2013, WMT@ACL.

[13]  Nitin Madnani,et al.  Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric , 2009, WMT@EACL.

[14]  Ondřej Bojar,et al.  Morphological Processing for English-Tamil Statistical Machine Translation , 2012 .

[15]  Anand Kumar Factored Statistical Machine Translation System for English to , 2014 .

[16]  Huda Khayrallah,et al.  Translation of Unknown Words in Low Resource Languages , 2016, AMTA.

[17]  Haizhou Li,et al.  Adequacy–Fluency Metrics: Evaluating MT in the Continuous Space Model Framework , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[19]  Bartholomaeus Ziegenbalg,et al.  Tamil Language for Europeans: Ziegenbalg's Grammatica Damulica (1716) , 2010 .

[20]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[21]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[22]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[23]  P. Vijayalakshmi,et al.  Analysis on bilingual machine translation systems for English and Tamil , 2016, 2016 International Conference on Computation of Power, Energy Information and Commuincation (ICCPEIC).

[24]  Eiichiro Sumita,et al.  Translation of unknown words in phrase-based statistical machine translation for languages of rich morphology , 2008, SLTU.

[25]  Hema A. Murthy,et al.  A common attribute based unified HTS framework for speech synthesis in Indian languages , 2013, SSW.

[26]  Hwee Tou Ng,et al.  Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages , 2009, EMNLP.

[27]  P. Vijayalakshmi,et al.  Performance improvement of Machine Translation system using LID and post-editing , 2016, 2016 IEEE Region 10 Conference (TENCON).

[28]  Nizar Habash,et al.  Four Techniques for Online Handling of Out-of-Vocabulary Words in Arabic-English Statistical Machine Translation , 2008, ACL.

[29]  R. S. Milton,et al.  Improving the Performance of Neural Machine Translation Involving Morphologically Rich Languages , 2016, ArXiv.