Data representation methods and use of mined corpora for Indian language transliteration

Our NEWS 2015 shared task submission is a PBSMT based transliteration system with the following corpus preprocessing enhancements: (i) addition of wordboundary markers, and (ii) languageindependent, overlapping character segmentation. We show that the addition of word-boundary markers improves transliteration accuracy substantially, whereas our overlapping segmentation shows promise in our preliminary analysis. We also compare transliteration systems trained using manually created corpora with the ones mined from parallel translation corpus for English to Indian language pairs. We identify the major errors in English to Indian language transliterations by analyzing heat maps of confusion matrices.

[1]  Eiichiro Sumita,et al.  A Bayesian model of bilingual segmentation for transliteration , 2010, IWSLT.

[2]  Sara Noeman Language Independent Transliteration System Using Phrase-based SMT Approach on Substrings , 2009, NEWS@IJCNLP.

[3]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[4]  Tiejun Zhao,et al.  Syllable-based Machine Transliteration with Extra Phrase Features , 2012, NEWS@ACL.

[5]  Pushpak Bhattacharyya,et al.  The IIT Bombay SMT System for ICON 2014 Tools Contest , 2014 .

[6]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[7]  Nadir Durrani,et al.  Integrating an Unsupervised Transliteration Model into Statistical Machine Translation , 2014, EACL.

[8]  Manoj Kumar Chinnakotla,et al.  Transliteration for Resource-Scarce Languages , 2010, TALIP.

[9]  Girish Nath Jha The TDIL Program and the Indian Langauge Corpora Intitiative (ILCI) , 2010, LREC.

[10]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[11]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[12]  Pushpak Bhattacharyya,et al.  Brahmi-Net: A transliteration and script conversion system for languages of the Indian subcontinent , 2015, NAACL.

[13]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[14]  Monojit Choudhury,et al.  Mining Hindi-English Transliteration Pairs from Online Hindi Lyrics , 2012, LREC.

[15]  Alexander M. Fraser,et al.  A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining , 2012, ACL.

[16]  Jörg Tiedemann,et al.  Character-Based Pivot Translation for Under-Resourced Languages and Domains , 2012, EACL.