Quick-and-clean extraction of linked data entities from microblogs

In this paper, we address the problem of finding Named Entities in very large micropost datasets. We propose methods to generate a sample of representative microposts by discovering tweets that are likely to refer to new entities. Our approach is able to significantly speed-up the semantic analysis process by discarding retweets, tweets without pre-identifiable entities, as well similar and redundant tweets, while retaining information content. We apply the approach on a corpus of 1:4 billion microposts, using the IE services of AlchemyAPI, Calais, and Zemanta to identify more than 700,000 unique entities. For the evaluation we compare runtime and number of entities extracted based on the full and the downscaled version of a micropost set. We are able to demonstrate that for datasets of more than 10 million tweets we can achieve a reduction in size of more than 80% while maintaining up to 60% coverage on unique entities cumulatively discovered by the three IE tools. We publish the resulting Twitter metadata as Linked Data using SIOC and an extension of the NERD core ontology.

[1]  Bu-Sung Lee,et al.  TwiNER: named entity recognition in targeted twitter stream , 2012, SIGIR '12.

[2]  Wolf-Tilo Balke,et al.  Any Suggestions? Active Schema Support for Structuring Web Information , 2014, DASFAA.

[3]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[4]  Sivaji Bandyopadhyay,et al.  NER from Tweets: SRI-JU System @MSM 2013 , 2013, #MSM.

[5]  Kamalakar Karlapalem,et al.  NERTUW: Named Entity Recognition on Tweets using Wikipedia , 2013, #MSM.

[6]  Óscar Corcho,et al.  Towards Concept Identification using a Knowledge-Intensive Approach , 2013, #MSM.

[7]  Harith Alani,et al.  Semantic Sentiment Analysis of Twitter , 2012, SEMWEB.

[8]  Benno Stein,et al.  Constructing efficient information extraction pipelines , 2011, CIKM '11.

[9]  Raphaël Troncy,et al.  NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Extraction Tools , 2012, EACL.

[10]  Quan Shi,et al.  The Research of Web Parallel Information Extraction Based on Hadoop , 2014 .

[11]  Lyle H. Ungar,et al.  Web-scale named entity recognition , 2008, CIKM '08.

[12]  Jeffrey V. Nickerson,et al.  Classifying Short Messages using Collaborative Knowledge Bases: Reading Wikipedia to Understand Twitter , 2013, #MSM.

[13]  Rik Van de Walle,et al.  Exploring entity recognition and disambiguation for cultural heritage collections , 2015, Digit. Scholarsh. Humanit..

[14]  Adriano Veloso,et al.  FS-NER: a lightweight filter-stream approach to named entity recognition on twitter data , 2013, WWW '13 Companion.

[15]  Peter Mika,et al.  Making Sense of Twitter , 2010, SEMWEB.

[16]  Pablo N. Mendes,et al.  DBpedia Spotlight at the MSM2013 Challenge , 2013, #MSM.

[17]  Eugene Agichtein Scaling Information Extraction to Large Document Collections , 2005, IEEE Data Eng. Bull..

[18]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[19]  Amir Hossein Jadidinejad Unsupervised Information Extraction using BabelNet and DBpedia , 2013, #MSM.

[20]  Eduard Hovy,et al.  Terascale knowledge acquisition , 2005 .