Uncertainty Handling in Named Entity Extraction and Disambiguation for Informal Text

Social media content represents a large portion of all textual content appearing on the Internet. These streams of user generated content UGC provide an opportunity and challenge for media analysts to analyze huge amount of new data and use them to infer and reason with new information. A main challenge of natural language is its ambiguity and vagueness. To automatically resolve ambiguity, the grammatical structure of sentences is used. However, when we move to informal language widely used in social media, the language becomes more ambiguous and thus more challenging for automatic understanding. Information Extraction IE is the research field that enables the use of unstructured text in a structured way. Named Entity Extraction NEE is a sub task of IE that aims to locate phrases mentions in the text that represent names of entities such as persons, organizations or locations regardless of their type. Named Entity Disambiguation NED is the task of determining which correct person, place, event, etc. is referred to by a mention. The goal of this paper is to provide an overview on some approaches that mimic the human way of recognition and disambiguation of named entities especially for domains that lack formal sentence structure. The proposed methods open the doors for more sophisticated applications based on users' contributions on social media. We propose a robust combined framework for NEE and NED in semi-formal and informal text. The achieved robustness has been proven to be valid across languages and domains and to be independent of the selected extraction and disambiguation techniques. It is also shown to be robust against the informality of the used language. We have discovered a reinforcement effect and exploited it a technique that improves extraction quality by feeding back disambiguation results. We present a method of handling the uncertainty involved in extraction to improve the disambiguation results.

[1]  Gerhard Weikum,et al.  AIDA: An Online Tool for Accurate Disambiguation of Named Entities in Text and Tables , 2011, Proc. VLDB Endow..

[2]  Sivaji Bandyopadhyay,et al.  A Hidden Markov Model Based Named Entity Recognition System: Bengali and Hindi as Case Studies , 2007, PReMI.

[3]  Dan Olteanu,et al.  MayBMS: a probabilistic database management system , 2009, SIGMOD Conference.

[4]  Sun-Ki Chai,et al.  Social Computing, Behavioral-Cultural Modeling and Prediction , 2014, Lecture Notes in Computer Science.

[5]  Aba-Sah Dadzie,et al.  Making Sense of Microposts (#MSM2013) Concept Extraction Challenge , 2013, #MSM.

[6]  Maurice van Keulen,et al.  Concept Extraction Challenge: University of Twente at #MSM2013 , 2013, #MSM.

[7]  Gerhard Weikum,et al.  YAGO2: exploring and querying world knowledge in time, space, context, and many languages , 2011, WWW.

[8]  Bu-Sung Lee,et al.  TwiNER: named entity recognition in targeted twitter stream , 2012, SIGIR '12.

[9]  Hanna M. Wallach,et al.  Conditional Random Fields: An Introduction , 2004 .

[10]  Subhash C. Kak,et al.  A Survey of Prediction Using Social Media , 2012, ArXiv.

[11]  Lise Getoor,et al.  Exploiting shared correlations in probabilistic databases , 2008, Proc. VLDB Endow..

[12]  Mohammad Ali Abbasi,et al.  Real-World Behavior Analysis through a Social Media Lens , 2012, SBP.

[13]  Oren Etzioni,et al.  Entity Linking at Web Scale , 2012, AKBC-WEKEX@NAACL-HLT.

[14]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[15]  Maurice van Keulen,et al.  Named Entity Extraction and Disambiguation: The Reinforcement Effect. , 2011 .

[16]  Dan Olteanu,et al.  Conditioning probabilistic databases , 2008, Proc. VLDB Endow..

[17]  Maurice van Keulen,et al.  Qualitative effects of knowledge rules and user feedback in probabilistic data integration , 2009, The VLDB Journal.

[18]  Maurice van Keulen,et al.  A Generic Open World Named Entity Disambiguation Approach for Tweets , 2013, KDIR/KMIS.