Named Entity Recognition for Mongolian Language

This paper presents a pioneering work on building a Named Entity Recognition system for the Mongolian language, with an agglutinative morphology and a subject-object-verb word order. Our work explores the fittest feature set from a wide range of features and a method that refines machine learning approach using gazetteers with approximate string matching, in an effort for robust handling of out-of-vocabulary words. As well as we tried to apply various existing machine learning methods and find optimal ensemble of classifiers based on genetic algorithm. The classifiers uses different feature representations. The resulting system constitutes the first-ever usable software package for Mongolian NER, while our experimental evaluation will also serve as a much-needed basis of comparison for further research.

[1]  András Kocsor,et al.  A Multilingual Named Entity Recognition System Using Boosting and C4.5 Decision Tree Learning Algorithms , 2006, Discovery Science.

[2]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[3]  Veronique Hoste,et al.  Dutch named entity recognition using classifier ensembles , 2010 .

[4]  Asif Ekbal,et al.  Maximum Entropy Classifier Ensembling using Genetic Algorithm for NER in Bengali , 2010, LREC.

[5]  Asif Ekbal,et al.  Classifier Ensemble Selection Using Genetic Algorithm for Named Entity Recognition , 2010 .

[6]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[7]  Tong Zhang,et al.  Named Entity Recognition through Classifier Combination , 2003, CoNLL.

[8]  Eszter Simon,et al.  Approaches to Hungarian Named Entity Recognition , 2013 .

[9]  Piskorski Jakub,et al.  Towards Person Name Matching for Inflective Languages , 2008 .

[10]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[11]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[12]  Hermann Ney,et al.  Maximum Entropy Models for Named Entity Recognition , 2003, CoNLL.

[13]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[14]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[15]  Purev Jaimai,et al.  Part of Speech Tagging for Mongolian Corpus , 2009, ALR7@IJCNLP.

[16]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[17]  Hideki Isozaki,et al.  Efficient Support Vector Classifiers for Named Entity Recognition , 2002, COLING.

[18]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.