This paper proposes a simple and elegant solution for language identification and named entity (NE) recognition at a word level, as a part of Subtask-1: Query Word Labeling of FIRE 2015. Given any query q1:w1 w2 w3 ... wn in Roman script, the task calls for labeling words of the query as English (En) or a member of L, where L = {Bengali (Bn), Gujarati (Gu), Hindi (Hi), Kannada (Kn), Malayalam (Ml), Marathi (Mr), Tamil (Ta), Telugu (Te)}. The approach presented in this paper uses the combination of a dictionary lookup with a Naïve Bayes classifier trained over character n-grams. Also, we devise an algorithm to resolve ambiguities between languages, for any given word in a query. Our system achieved impressive f-measure scores of 85-90% in four languages and 74-80% in another four languages.
[1]
Kevin Knight,et al.
Machine Transliteration
,
1997,
CL.
[2]
W. B. Cavnar,et al.
N-gram-based text categorization
,
1994
.
[3]
Monojit Choudhury,et al.
Challenges in Designing Input Method Editors for Indian Lan-guages: The Role of Word-Origin and Context
,
2011,
WTIM@IJCNLP.
[4]
G. Srinivasa,et al.
Hindi-English Language Identification , Named Entity Recognition and Back Transliteration : Shared Task System Description
,
2014
.
[5]
Royal Denzil Sequiera,et al.
Word-Level Language Identification and Back Transliteration of Romanized Text
,
2014,
FIRE '14.