Orthographic Syllable as basic unit for SMT between Related Languages

We explore the use of the orthographic syllable, a variable-length consonant-vowel sequence, as a basic unit of translation between related languages which use abugida or alphabetic scripts. We show that orthographic syllable level translation significantly outperforms models trained over other basic units (word, morpheme and character) when training over small parallel corpora.

[1]  Jörg Tiedemann,et al.  News from OPUS — A collection of multilingual parallel corpora with tools and interfaces , 2009 .

[2]  Ondřej Bojar,et al.  Morphological Processing for English-Tamil Statistical Machine Translation , 2012 .

[3]  Preslav Nakov,et al.  Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages , 2012, ACL.

[4]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[5]  Christian Biemann,et al.  Corpus Portal for Search in Monolingual Corpora , 2006, LREC.

[6]  I. Dan Melamed,et al.  Automatic Evaluation and Uniform Filter Cascades for Inducing N-Best Translation Lexicons , 1995, VLC@ACL.

[7]  Ondrej Bojar,et al.  HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation , 2014, LREC.

[8]  R. Sproat A FORMAL COMPUTATIONAL ANALYSIS OF INDIC SCRIPTS , 2003 .

[9]  Pushpak Bhattacharyya,et al.  Brahmi-Net: A transliteration and script conversion system for languages of the Indian subcontinent , 2015, NAACL.

[10]  Mikko Kurimo,et al.  Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline , 2013 .

[11]  Jörg Tiedemann,et al.  Character-Based Pivot Translation for Under-Resourced Languages and Domains , 2012, EACL.

[12]  Preslav Nakov,et al.  Analyzing the Use of Character-Level Translation with Sparse and Noisy Datasets , 2013, RANLP.

[13]  Nadir Durrani,et al.  Integrating an Unsupervised Transliteration Model into Statistical Machine Translation , 2014, EACL.

[14]  Anil Kumar Singh A Computational Phonetic Model for Indian Language Scripts , 2006 .

[15]  Hermann Ney,et al.  Can We Translate Letters? , 2007, WMT@ACL.

[16]  M. B. Emeneau India as a Lingustic Area , 1956 .

[17]  Pushpak Bhattacharyya,et al.  Statistical Machine Translation between Related Languages , 2016, HLT-NAACL Tutorials.

[18]  Girish Nath Jha The TDIL Program and the Indian Langauge Corpora Intitiative (ILCI) , 2010, LREC.

[19]  Sami Virpioja,et al.  LeBLEU: N-gram-based Translation Evaluation Score for Morphologically Complex Languages , 2015, WMT@EMNLP.

[20]  Jörg Tiedemann,et al.  Character-Based PSMT for Closely Related Languages , 2009, EAMT.

[21]  George F. Foster,et al.  Batch Tuning Strategies for Statistical Machine Translation , 2012, NAACL.

[22]  Sivaji Bandyopadhyay,et al.  A Modified Joint Source-Channel Model for Transliteration , 2006, ACL.

[23]  Nadir Durrani,et al.  Hindi-to-Urdu Machine Translation through Transliteration , 2010, ACL.

[24]  Pushpak Bhattacharyya,et al.  The IIT Bombay SMT System for ICON 2014 Tools Contest , 2014 .

[25]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.