Word-Level Language Identification and Back Transliteration of Romanized Text
暂无分享,去创建一个
This paper presents the BMSCE team's participation in `FIRE Shared Task on Transliterated Search subtask-1'. Our Language Identification system is based on the n-grams approach and uses a tri-gram language identifier trained over a shared and collected training set to classify the language of a word at the. We use a rule based approach blended with simple dictionary search to back transliterate the Romanized Kannada word. We participated in the Bengali-English, Guajarati-English, Kannada-English, Malayalam-English and Tamil-English language tracks and have obtained 70-80% accuracy for the language pairs.
[1] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .
[2] P. Nather. N-Gram based Text Categorization , 2005 .
[3] Carol Myers-Scotton,et al. Duelling Languages: Grammatical Structure in Codeswitching , 1993 .
[4] Rishiraj Saha Roy,et al. Overview and Datasets of FIRE 2013 Track on Transliterated Search , 2013 .
[5] Prasenjit Majumder,et al. Overview of the FIRE 2013 Track on Transliterated Search , 2013, FIRE.