Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams

The manual curation of knowledge bases is a bottleneck in fast paced domains where new concepts constantly emerge. Identification of nascent concepts is important for improving early entity linking, content interpretation, and recommendation of new content in real-time applications. We present an unsupervised method for generating pseudo-ground truth for training a named entity recognizer to specifically identify entities that will become concepts in a knowledge base in the setting of social streams. We show that our method is able to deal with missing labels, justifying the use of pseudo-ground truth generation in this task. Finally, we show how our method significantly outperforms a lexical-matching baseline, by leveraging strategies for sampling pseudo-ground truth based on entity confidence scores and textual quality of input documents.

[1]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[2]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[3]  Joel Nothman,et al.  Learning multilingual named entity recognition from Wikipedia , 2013, Artif. Intell..

[4]  Kalina Bontcheva,et al.  TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text , 2013, RANLP.

[5]  M. de Rijke,et al.  Adding semantics to microblog posts , 2012, WSDM '12.

[6]  Ee-Peng Lim,et al.  Comments-oriented document summarization: understanding documents with readers' feedback , 2008, SIGIR '08.

[7]  Heng Ji,et al.  Analysis and Enhancement of Wikification for Microblogs with Context Expansion , 2012, COLING.

[8]  Maarten Marx,et al.  Two-Stage Named-Entity Recognition Using Averaged Perceptrons , 2012, NLDB.

[9]  Oren Etzioni,et al.  No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities , 2012, EMNLP.

[10]  Kalina Bontcheva,et al.  Microblog-genre noise and impact on semantic annotation accuracy , 2013, HT.

[11]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[12]  M. de Rijke,et al.  Incorporating Query Expansion and Quality Indicators in Searching Microblog Posts , 2011, ECIR.

[13]  Beatrice Alex,et al.  Optimising Selective Sampling for Bootstrapping Named Entity Recognition , 2005, ICML 2005.

[14]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[15]  Ming-Wei Chang,et al.  To Link or Not to Link? A Study on End-to-End Tweet Entity Linking , 2013, NAACL.

[16]  M. de Rijke,et al.  Credibility-inspired ranking for blog post retrieval , 2012, Information Retrieval.

[17]  Maguelonne Teisseire,et al.  Natural Language Processing and Information Systems , 2014, Lecture Notes in Computer Science.

[18]  Oren Etzioni,et al.  Open domain event extraction from twitter , 2012, KDD.

[19]  Lan Nie,et al.  Resolving Surface Forms to Wikipedia Topics , 2010, COLING.

[20]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[21]  Andrew Y. Ng,et al.  Solving the Problem of Cascading Errors: Approximate Bayesian Inference for Linguistic Annotation Pipelines , 2006, EMNLP.

[22]  Aba-Sah Dadzie,et al.  Making Sense of Microposts (#MSM2013) Concept Extraction Challenge , 2013, #MSM.

[23]  Valentin Jijkoun,et al.  Named entity normalization in user generated content , 2008, AND '08.

[24]  Nan Ye,et al.  Domain adaptive bootstrapping for named entity recognition , 2009, EMNLP.

[25]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[26]  Wagner Meira,et al.  Named Entity Disambiguation in Streaming Data , 2012, ACL.

[27]  Maarten de Rijke,et al.  Feeding the Second Screen: Semantic Linking based on Subtitles , 2013, DIR.

[28]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[29]  Iadh Ounis,et al.  Overview of the TREC 2011 Microblog Track , 2011, TREC.

[30]  Zornitsa Kozareva Bootstrapping Named Entity Recognition with Automatically Generated Gazetteer Lists , 2006, EACL.