论文信息 - NELIS - Named Entity and Language Identification System: Shared Task System Description

NELIS - Named Entity and Language Identification System: Shared Task System Description

This paper proposes a simple and elegant solution for language identification and named entity (NE) recognition at a word level, as a part of Subtask-1: Query Word Labeling of FIRE 2015. Given any query q1:w1 w2 w3 ... wn in Roman script, the task calls for labeling words of the query as English (En) or a member of L, where L = {Bengali (Bn), Gujarati (Gu), Hindi (Hi), Kannada (Kn), Malayalam (Ml), Marathi (Mr), Tamil (Ta), Telugu (Te)}. The approach presented in this paper uses the combination of a dictionary lookup with a Naïve Bayes classifier trained over character n-grams. Also, we devise an algorithm to resolve ambiguities between languages, for any given word in a query. Our system achieved impressive f-measure scores of 85-90% in four languages and 74-80% in another four languages.

Gowri Srinivasa | Sampath Shanmugam | Rampreeth Ethiraj | Navneet Sinha

[1] Kevin Knight,et al. Machine Transliteration , 1997, CL.

[2] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .

[3] Monojit Choudhury,et al. Challenges in Designing Input Method Editors for Indian Lan-guages: The Role of Word-Origin and Context , 2011, WTIM@IJCNLP.

[4] G. Srinivasa,et al. Hindi-English Language Identification , Named Entity Recognition and Back Transliteration : Shared Task System Description , 2014 .

[5] Royal Denzil Sequiera,et al. Word-Level Language Identification and Back Transliteration of Romanized Text , 2014, FIRE '14.