TwitterNEED: A hybrid approach for named entity extraction and disambiguation for tweet*

Twitter is a rich source of continuously and instantly updated information. Shortness and informality of tweets are challenges for Natural Language Processing tasks. In this paper, we present TwitterNEED, a hybrid approach for Named Entity Extraction and Named Entity Disambiguation for tweets. We believe that disambiguation can help to improve the extraction process. This mimics the way humans understand language and reduces error propagation in the whole system. Our extraction approach aims for high extraction recall first, after which a Support Vector Machine attempts to filter out false positives among the extracted candidates using features derived from the disambiguation phase in addition to other word shape and Knowledge Base features. For Named Entity Disambiguation, we obtain a list of entity candidates from the YAGO Knowledge Base in addition to top-ranked pages from the Google search engine for each extracted mention. We use a Support Vector Machine to rank the candidate pages according to a set of URL and context similarity features. For evaluation, five data sets are used to evaluate the extraction approach, and three of them to evaluate both the disambiguation approach and the combined extraction and disambiguation approach. Experiments show better results compared to our competitors DBpedia Spotlight, Stanford Named Entity Recognition, and the AIDA disambiguation system.

[1]  Oren Etzioni,et al.  Entity Linking at Web Scale , 2012, AKBC-WEKEX@NAACL-HLT.

[2]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[3]  Wagner Meira,et al.  Named Entity Disambiguation in Streaming Data , 2012, ACL.

[4]  Kalina Bontcheva,et al.  TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text , 2013, RANLP.

[5]  van Gerardus Noord,et al.  Special issue: finite state methods in natural language processing , 2003 .

[6]  Maurice van Keulen,et al.  Improving Toponym Disambiguation by Iteratively Enhancing Certainty of Extraction , 2012, KDIR.

[7]  Cong Yu,et al.  Searching Social Updates for Topic-centric Entities , 2011, VLDS.

[8]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[9]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[10]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[11]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[12]  Julio Gonzalo,et al.  Filter Keywords and Majority Class Strategies for Company Name Disambiguation in Twitter , 2011, CLEF.

[13]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[14]  Gerhard Weikum,et al.  AIDA: An Online Tool for Accurate Disambiguation of Named Entities in Text and Tables , 2011, Proc. VLDB Endow..

[15]  Rohini K. Srihari,et al.  Cross document person name disambiguation using entity profiles , 2009, TAC.

[16]  Jimmy J. Lin,et al.  WTF: the who to follow service at Twitter , 2013, WWW.

[17]  Maurice van Keulen,et al.  A Hybrid Approach for Robust Multilingual Toponym Extraction and Disambiguation , 2013, IIS.

[18]  Aba-Sah Dadzie,et al.  Making Sense of Microposts (#MSM2013) Concept Extraction Challenge , 2013, #MSM.

[19]  Djoerd Hiemstra,et al.  Closed form maximum likelihood estimator of conditional random fields , 2013 .

[20]  Aba-Sah Dadzie,et al.  Making Sense of Microposts (#Microposts2014) Named Entity Extraction & Linking Challenge , 2014, #MSM.

[21]  Stephen Dann,et al.  Twitter content classification , 2010, First Monday.

[22]  Surajit Chaudhuri,et al.  Targeted disambiguation of ad-hoc, homogeneous sets of named entities , 2012, WWW.

[23]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[24]  Xiaolong Li,et al.  An Overview of Microsoft Web N-gram Corpus and Applications , 2010, NAACL.

[25]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[26]  Oren Etzioni,et al.  Named entity recognition in tweets , 2011, EMNLP 2011.

[27]  Gerhard Weikum,et al.  Robust Disambiguation of Named Entities in Text , 2011, EMNLP.

[28]  Divya,et al.  Techniques to Detect Spammers in Twitter- A Survey , 2014 .

[29]  Raquel Martínez Unanue,et al.  Unsupervised Real-Time Company Name Disambiguation in Twitter , 2012, ICWSM 2012.

[30]  Brian Locke Named Entity Recognition : Adapting to Microblogging , 2009 .

[31]  Barbara Poblete,et al.  Information credibility on twitter , 2011, WWW.

[32]  Maurice van Keulen,et al.  Unsupervised Improvement of Named Entity Extraction in Short Informal Context Using Disambiguation Clues , 2012, SWAIE.

[33]  P. Howard,et al.  Democracy's Fourth Wave?: Digital Media and the Arab Spring , 2013 .

[34]  Djoerd Hiemstra,et al.  Separate training for conditional random fields using co-occurrence rate factorization , 2010, 1008.1566.

[35]  John D. Lafferty,et al.  A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval , 2017, SIGF.

[36]  Jason J. Jung Online named entity recognition method for microtexts in social networking services: A case study of twitter , 2012, Expert Syst. Appl..

[37]  Yvan Saeys,et al.  Java-ML: A Machine Learning Library , 2009, J. Mach. Learn. Res..

[38]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[39]  Bu-Sung Lee,et al.  TwiNER: named entity recognition in targeted twitter stream , 2012, SIGIR '12.

[40]  Andrew McCallum,et al.  Piecewise Training for Undirected Models , 2005, UAI.

[41]  Ganesh Ramakrishnan,et al.  Collective annotation of Wikipedia entities in web text , 2009, KDD.

[42]  Elaine Marsh,et al.  MUC-7 Evaluation of IE Technology: Overview of Results , 1998, MUC.

[43]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[44]  Raphaël Troncy,et al.  NERD: evaluating named entity recognition tools in the web of data , 2011 .

[45]  Rik Van de Walle,et al.  Adding Meaning to Social Network Microposts via Multiple Named Entity Disambiguation APIs and Tracking Their Data Provenance , 2013 .

[46]  S. J. Sullivan,et al.  ‘What's happening?’ A content analysis of concussion-related traffic on Twitter , 2011, British Journal of Sports Medicine.

[47]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[48]  Karl Aberer,et al.  Entity-based Classification of Twitter Messages , 2012, Int. J. Comput. Sci. Appl..

[49]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.