论文信息 - Word-Level Language Identification and Back Transliteration of Romanized Text

Word-Level Language Identification and Back Transliteration of Romanized Text

This paper presents the BMSCE team's participation in `FIRE Shared Task on Transliterated Search subtask-1'. Our Language Identification system is based on the n-grams approach and uses a tri-gram language identifier trained over a shared and collected training set to classify the language of a word at the. We use a rule based approach blended with simple dictionary search to back transliterate the Romanized Kannada word. We participated in the Bengali-English, Guajarati-English, Kannada-English, Malayalam-English and Tamil-English language tracks and have obtained 70-80% accuracy for the language pairs.

Royal Denzil Sequiera | Shashank S. Rao | B. R. Shambavi

[1] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .

[2] P. Nather. N-Gram based Text Categorization , 2005 .

[3] Carol Myers-Scotton,et al. Duelling Languages: Grammatical Structure in Codeswitching , 1993 .

[4] Rishiraj Saha Roy,et al. Overview and Datasets of FIRE 2013 Track on Transliterated Search , 2013 .

[5] Prasenjit Majumder,et al. Overview of the FIRE 2013 Track on Transliterated Search , 2013, FIRE.