Terms over LOAD: Leveraging Named Entities for Cross-Document Extraction and Summarization of Events

Real world events, such as historic incidents, typically contain both spatial and temporal aspects and involve a specific group of persons. This is reflected in the descriptions of events in textual sources, which contain mentions of named entities and dates. Given a large collection of documents, however, such descriptions may be incomplete in a single document, or spread across multiple documents. In these cases, it is beneficial to leverage partial information about the entities that are involved in an event to extract missing information. In this paper, we introduce the LOAD model for cross-document event extraction in large-scale document collections. The graph-based model relies on co-occurrences of named entities belonging to the classes locations, organizations, actors, and dates and puts them in the context of surrounding terms. As such, the model allows for efficient queries and can be updated incrementally in negligible time to reflect changes to the underlying document collection. We discuss the versatility of this approach for event summarization, the completion of partial event information, and the extraction of descriptions for named entities and dates. We create and provide a LOAD graph for the documents in the English Wikipedia from named entities extracted by state-of-the-art NER tools. Based on an evaluation set of historic data that include summaries of diverse events, we evaluate the resulting graph. We find that the model not only allows for near real-time retrieval of information from the underlying document collection, but also provides a comprehensive framework for browsing and summarizing event data.

[1]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[2]  Gerhard Weikum,et al.  InZeit: Efficiently Identifying Insightful Time Points , 2010, Proc. VLDB Endow..

[3]  Johanna Geiß,et al.  Beyond friendships and followers: The Wikipedia social network , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[4]  Klaus Berberich,et al.  Identifying Time Intervals of Interest to Queries , 2014, CIKM.

[5]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[6]  Mitul Tiwari,et al.  Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia-Based Approach , 2013, Proc. VLDB Endow..

[7]  Tanmoy Chakraborty,et al.  OverCite: Finding overlapping communities in citation network , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[8]  Ricardo Campos,et al.  Survey of Temporal Information Retrieval and Related Applications , 2014, ACM Comput. Surv..

[9]  David Yarowsky,et al.  One Sense Per Discourse , 1992, HLT.

[10]  Andreas Spitz,et al.  Terms in Time and Times in Context: A Graph-based Term-Time Ranking Model , 2015, WWW.

[11]  Cong Yu,et al.  Dynamic relationship and event discovery , 2011, WSDM '11.

[12]  Johanna Geiß,et al.  The Wikipedia location network: overcoming borders and oceans , 2015, GIR.

[13]  Nattiya Kanhabua,et al.  Identifying Relevant Temporal Expressions for Real-World Events , 2012 .

[14]  Tao Tao,et al.  An exploration of proximity measures in information retrieval , 2007, SIGIR.

[15]  Christina Lioma,et al.  Graph-based term weighting for information retrieval , 2011, Information Retrieval.

[16]  Mark Gahegan,et al.  Frankenplace: Interactive Thematic Mapping for Ad Hoc Exploratory Search , 2015, WWW.

[17]  Gerhard Weikum,et al.  Cross-Document Co-Reference Resolution using Sample-Based Clustering with Knowledge Enrichment , 2015, TACL.

[18]  Wolfgang Nejdl,et al.  On the Value of Temporal Anchor Texts in Wikipedia , 2014 .

[19]  Wolfgang Nejdl,et al.  Extracting Event-Related Information from Article Updates in Wikipedia , 2013, ECIR.

[20]  David Yarowsky,et al.  One Sense per Collocation , 1993, HLT.

[21]  Oren Etzioni,et al.  Open domain event extraction from twitter , 2012, KDD.

[22]  Ramesh Nallapati,et al.  Event threading within news topics , 2004, CIKM '04.

[23]  Gerhard Weikum,et al.  CATE: context-aware timeline for entity illustration , 2011, WWW.

[24]  Abdalghani Abujabal,et al.  Important Events in the Past, Present, and Future , 2015, WWW.

[25]  Michael Gertz,et al.  Multilingual and cross-domain temporal tagging , 2012, Language Resources and Evaluation.

[26]  Gerhard Weikum,et al.  A Fresh Look on Knowledge Bases: Distilling Named Events from News , 2014, CIKM.

[27]  Rishiraj Saha Roy,et al.  Discovering and understanding word level user intent in Web search queries , 2015, J. Web Semant..

[28]  Michael Gertz,et al.  Temporal Tagging on Different Domains: Challenges, Strategies, and Gold Standards , 2012, LREC.

[29]  Gerhard Weikum,et al.  Extraction of temporal facts and events from Wikipedia , 2012, TempWeb '12.

[30]  W. Bruce Croft,et al.  Modeling higher-order term dependencies in information retrieval using query hypergraphs , 2012, SIGIR '12.

[31]  Goran Nenadic,et al.  Mining temporal footprints from Wikipedia , 2014, COLING 2014.

[32]  Michalis Vazirgiannis,et al.  Graph-of-word and TW-IDF: new approach to ad hoc IR , 2013, CIKM.

[33]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[34]  Omar Alonso,et al.  Timelines as summaries of popular scheduled events , 2013, WWW.

[35]  Christoph Boden,et al.  Extracting a Repository of Events and Event References from News Clusters , 2014 .

[36]  Fabian M. Suchanek,et al.  Mining history with Le Monde , 2013, AKBC '13.

[37]  James Allan,et al.  Finding and linking incidents in news , 2007, CIKM '07.

[38]  Mark Liberman,et al.  Corpora for topic detection and tracking , 2002 .

[39]  Adam Jatowt,et al.  Estimating document focus time , 2013, CIKM.

[40]  Klaus Berberich,et al.  Linking Wikipedia Events to Past News , 2014, SIGIR 2014.

[41]  Gerhard Weikum,et al.  Timely YAGO: harvesting, querying, and visualizing temporal knowledge from Wikipedia , 2010, EDBT '10.