A Multilingual Named Entity Recognition System Using Boosting and C4.5 Decision Tree Learning Algorithms

In this paper we introduce a multilingual Named Entity Recognition (NER) system that uses statistical modeling techniques. The system identifies and classifies NEs in the Hungarian and English languages by applying AdaBoostM1 and the C4.5 decision tree learning algorithm. We focused on building as large a feature set as possible, and used a split and recombine technique to fully exploit its potentials. This methodology provided an opportunity to train several independent decision tree classifiers based on different subsets of features and combine their decisions in a majority voting scheme. The corpus made for the CoNLL 2003 conference and a segment of Szeged Corpus was used for training and validation purposes. Both of them consist entirely of newswire articles. Our system remains portable across languages without requiring any major modification and slightly outperforms the best system of CoNLL 2003, and achieved a 94.77% F measure for Hungarian. The real value of our approach lies in its different basis compared to other top performing models for English, which makes our system extremely successful when used in combination with CoNLL modells.

[1]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[4]  Nancy Chinchor,et al.  Appendix E: MUC-7 Named Entity Task Definition (version 3.5) , 1998, MUC.

[5]  David Yarowsky,et al.  Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence , 1999, EMNLP.

[6]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[7]  Ian Witten,et al.  Data Mining , 2000 .

[8]  Xavier Carreras,et al.  Named Entity Extraction using AdaBoost , 2002, CoNLL.

[9]  Hwee Tou Ng,et al.  Named Entity Recognition with a Maximum Entropy Approach , 2003, CoNLL.

[10]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[11]  Tong Zhang,et al.  Named Entity Recognition through Classifier Combination , 2003, CoNLL.

[12]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[13]  János Csirik,et al.  The Szeged Corpus: A POS Tagged and Syntactically Annotated Hungarian Natural Language Corpus , 2004, TSD.

[14]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[15]  Nigel Collier,et al.  Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications , 2004 .

[16]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[17]  János Csirik,et al.  A highly accurate Named Entity corpus for Hungarian , 2006, LREC.

[18]  András Kocsor,et al.  Named Entity Recognition for Hungarian Using Various Machine Learning Algorithms , 2006, Acta Cybern..