Combining Named Entity Recognition Methods for Concept Extraction in Microposts

NER in microposts is a key and challenging task of mining semantics from social media. Our evaluation of a number of popular NE recognizers over a micropost dataset has shown a significant drop-o in results quality. Current state-of-theart NER methods perform much better on formal text than on microposts. However, the experiment provided us with an interesting observation ‐ although individual NER tools did not perform very well on micropost data, we have received recall over 90% when we merged all the results of the examined tools. This means that if we would be able to combine dierent NE recognizers in a meaningful way, we might be able to get NER in microposts of an acceptable quality. In this paper, we propose a method for NER in microposts, which is designed to combine annotations yielded by existing NER tools in order to produce more precise results than input tools alone. We combine NE recognizers utilizing ML techniques, namely decision tree and random forest using the C4.5 algorithm. The main advantage of the proposed method lies in the possibility of combining arbitrary NER methods and in its application on short, informal texts. The evaluation on a standard dataset shows that the proposed approach outperforms underlying NER methods as well as a baseline recognizer, which is a simple combination of the best underlying recognizers for each target NE class. To the best of our knowledge, up-to-date, the proposed approach achieves the highest F1 score on the #MSM2013 dataset.

[1]  Giang Nguyen,et al.  Character gazetteer for Named Entity Recognition with linear matching complexity , 2013, 2013 Third World Congress on Information and Communication Technologies (WICT 2013).

[2]  Ryan Cotterell,et al.  Nerit: Named Entity Recognition for Informal Text , 2013 .

[3]  Asif Ekbal,et al.  Combining multiple classifiers using vote based classifier ensemble technique for named entity recognition , 2013, Data Knowl. Eng..

[4]  Aba-Sah Dadzie,et al.  Making Sense of Microposts (#MSM2013) Concept Extraction Challenge , 2013, #MSM.

[5]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  Stefan Dlugolinsky,et al.  Evaluation of named entity recognition tools on microposts , 2013, 2013 IEEE 17th International Conference on Intelligent Engineering Systems (INES).

[8]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[9]  Saso Dzeroski,et al.  Combining Classifiers with Meta Decision Trees , 2003, Machine Learning.

[10]  Luo Si,et al.  Boosting performance of bio-entity recognition by combining results from multiple systems , 2005, BIOKDD.

[11]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[12]  Elaine Marsh,et al.  MUC-7 Evaluation of IE Technology: Overview of Results , 1998, MUC.

[13]  Maurice van Keulen,et al.  Concept Extraction Challenge: University of Twente at #MSM2013 , 2013, #MSM.

[14]  Kalina Bontcheva,et al.  Making sense of social media streams through semantics: A survey , 2014, Semantic Web.

[15]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[16]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[17]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[18]  Ladislav Hluchý,et al.  Distributed Web-Scale Infrastructure for Crawling, Indexing and Search with Semantic Support , 2012, Comput. Sci..

[19]  Richard Laine,et al.  Big things come in small packages , 2002 .

[20]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[21]  Doug Downey,et al.  Local and Global Algorithms for Disambiguation to Wikipedia , 2011, ACL.

[22]  Ian H. Witten,et al.  An open-source toolkit for mining Wikipedia , 2013, Artif. Intell..

[23]  Diana Maynard,et al.  JAPE: a Java Annotation Patterns Engine , 2000 .

[24]  Tong Zhang,et al.  Named Entity Recognition through Classifier Combination , 2003, CoNLL.

[25]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[26]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[27]  Bu-Sung Lee,et al.  TwiNER: named entity recognition in targeted twitter stream , 2012, SIGIR '12.

[28]  Ladislav Hluchý,et al.  Towards a Search System for the Web Exploiting Spatial Data of a Web Document , 2010, 2010 Workshops on Database and Expert Systems Applications.

[29]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[30]  Peter Krammer,et al.  MSM2013 IE Challenge: Annotowatch , 2013, #MSM.

[31]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[32]  Ming Zhou,et al.  Recognizing Named Entities in Tweets , 2011, ACL.