Word-Level Language Identification and Back Transliteration of Romanized Text

This paper presents the BMSCE team's participation in `FIRE Shared Task on Transliterated Search subtask-1'. Our Language Identification system is based on the n-grams approach and uses a tri-gram language identifier trained over a shared and collected training set to classify the language of a word at the. We use a rule based approach blended with simple dictionary search to back transliterate the Romanized Kannada word. We participated in the Bengali-English, Guajarati-English, Kannada-English, Malayalam-English and Tamil-English language tracks and have obtained 70-80% accuracy for the language pairs.