NELIS - Named Entity and Language Identification System: Shared Task System Description

This paper proposes a simple and elegant solution for language identification and named entity (NE) recognition at a word level, as a part of Subtask-1: Query Word Labeling of FIRE 2015. Given any query q1:w1 w2 w3 ... wn in Roman script, the task calls for labeling words of the query as English (En) or a member of L, where L = {Bengali (Bn), Gujarati (Gu), Hindi (Hi), Kannada (Kn), Malayalam (Ml), Marathi (Mr), Tamil (Ta), Telugu (Te)}. The approach presented in this paper uses the combination of a dictionary lookup with a Naïve Bayes classifier trained over character n-grams. Also, we devise an algorithm to resolve ambiguities between languages, for any given word in a query. Our system achieved impressive f-measure scores of 85-90% in four languages and 74-80% in another four languages.