Document Filtering for Long-tail Entities

Filtering relevant documents with respect to entities is an essential task in the context of knowledge base construction and maintenance. It entails processing a time-ordered stream of documents that might be relevant to an entity in order to select only those that contain vital information. State-of-the-art approaches to document filtering for popular entities are entity-dependent: they rely on and are also trained on the specifics of differentiating features for each specific entity. Moreover, these approaches tend to use so-called extrinsic information such as Wikipedia page views and related entities which is typically only available only for popular head entities. Entity-dependent approaches based on such signals are therefore ill-suited as filtering methods for long-tail entities. In this paper we propose a document filtering method for long-tail entities that is entity-independent and thus also generalizes to unseen or rarely seen entities. It is based on intrinsic features, i.e., features that are derived from the documents in which the entities are mentioned. We propose a set of features that capture informativeness, entity-saliency, and timeliness. In particular, we introduce features based on entity aspect similarities, relation patterns, and temporal expressions and combine these with standard features for document filtering. Experiments following the TREC KBA 2014 setup on a publicly available dataset show that our model is able to improve the filtering performance for long-tail entities over several baselines. Results of applying the model to unseen entities are promising, indicating that the model is able to learn the general characteristics of a vital document. The overall performance across all entities---i.e., not just long-tail entities---improves upon the state-of-the-art without depending on any entity-specific training data.

[1]  Chin-Yew Lin,et al.  MSR KMG at TREC 2014 KBA Track Vital Filtering Task , 2014 .

[2]  Luo Si,et al.  LDTM: A Latent Document Type Model for Cumulative Citation Recommendation , 2015, EMNLP.

[3]  Daniel Gillick,et al.  A New Entity Salience Task with Millions of Training Examples , 2014, EACL.

[4]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[5]  Maarten de Rijke,et al.  Mining, Ranking and Recommending Entity Aspects , 2015, SIGIR.

[6]  Shashi Shekhar,et al.  Automatic Information Extraction , 2008, Encyclopedia of GIS.

[7]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[8]  Wei Song,et al.  Multi-aspect query summarization by composite query , 2012, SIGIR '12.

[9]  M. de Rijke,et al.  Learning to Explain Entity Relationships in Knowledge Graphs , 2015, ACL.

[10]  Kevin Chen-Chuan Chang,et al.  Entity-centric document filtering: boosting feature mapping through meta-features , 2013, CIKM.

[11]  Hui Fang,et al.  A Related Entity based Approach for Knowledge Base Acceleration , 2013, TREC.

[12]  Niranjan Balasubramanian,et al.  Automatic generation of topic pages using query-based aspect models , 2009, CIKM.

[13]  Yinglin Wang,et al.  Generating Aspect-oriented Multi-Document Summarization with Event-aspect model , 2011, EMNLP.

[14]  Carlos Guestrin,et al.  Distributed Non-Parametric Representations for Vital Filtering: UW at TREC KBA 2014 , 2014, TREC.

[15]  Niranjan Balasubramanian,et al.  Topic Pages: An Alternative to the Ten Blue Links , 2010, 2010 IEEE Fourth International Conference on Semantic Computing.

[16]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[17]  M. de Rijke,et al.  Discovering missing links in Wikipedia , 2005, LinkKDD '05.

[18]  Prasenjit Mitra,et al.  WikiKreator: Improving Wikipedia Stubs Automatically , 2015, ACL.

[19]  Avishek Anand,et al.  Automated News Suggestions for Populating Wikipedia Entity Pages , 2015, CIKM.

[20]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[21]  Luo Si,et al.  An Entity Class-Dependent Discriminative Mixture Model for Cumulative Citation Recommendation , 2015, SIGIR.

[22]  Krisztian Balog,et al.  Cumulative citation recommendation: classification vs. ranking , 2013, SIGIR.

[23]  Arjen P. de Vries,et al.  Entity-Centric Stream Filtering and Ranking: Filtering and Unfilterable Documents , 2015, ECIR.

[24]  Maarten de Rijke,et al.  Prior-informed Distant Supervision for Temporal Evidence Classification , 2014, COLING.

[25]  Krisztian Balog,et al.  Multi-step classification approaches to cumulative citation recommendation , 2013, OAIR.

[26]  Ludovic Bonnefoy,et al.  A weakly-supervised detection of entity central documents in a stream , 2013, SIGIR.

[27]  Patrick Pantel,et al.  Jigs and Lures: Associating Web Queries with Structured Entities , 2011, ACL.

[28]  Yuzhong Qu,et al.  Summarizing Entity Descriptions for Effective and Efficient Human-centered Entity Linking , 2015, WWW.

[29]  Feng Niu,et al.  Building an Entity-Centric Stream Filtering Test Collection for TREC 2012 , 2012, TREC.

[30]  James Allan,et al.  Introduction to topic detection and tracking , 2002 .

[31]  Gerhard Weikum,et al.  Gem-based entity-knowledge maintenance , 2013, CIKM.

[32]  Laura Dietz,et al.  UMass at TREC 2013 Knowledge Base Acceleration Track: Bi-directional Entity Linking and Time-aware Evaluation , 2013, TREC.

[33]  Lejian Liao,et al.  BIT and MSRA at TREC KBA CCR Track 2013 , 2013, TREC.