Growing Story Forest Online from Massive Breaking News

We describe our experience of implementing a news content organization system at Tencent that discovers events from vast streams of breaking news and evolves news story structures in an online fashion. Our real-world system has distinct requirements in contrast to previous studies on topic detection and tracking (TDT) and event timeline or graph generation, in that we 1) need to accurately and quickly extract distinguishable events from massive streams of long text documents that cover diverse topics and contain highly redundant information, and 2) must develop the structures of event stories in an online manner, without repeatedly restructuring previously formed stories, in order to guarantee a consistent user viewing experience. In solving these challenges, we propose Story Forest, a set of online schemes that automatically clusters streaming documents into events, while connecting related events in growing trees to tell evolving stories. We conducted extensive evaluation based on 60 GB of real-world Chinese news data, although our ideas are not language-dependent and can easily be extended to other languages, through detailed pilot user experience studies. The results demonstrate the superior capability of Story Forest to accurately identify events and organize news text into a logical structure that is appealing to human readers, compared to multiple existing algorithm frameworks.

[1]  Deyu Zhou,et al.  An Unsupervised Bayesian Modelling Approach for Storyline Detection on News Articles , 2015, EMNLP.

[2]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[3]  Claudio Castellano,et al.  Defining and identifying communities in networks. , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Claire Cardie,et al.  Socially-Informed Timeline Generation for Complex Events , 2015, HLT-NAACL.

[5]  AllanJames,et al.  On-Line New Event Detection and Tracking , 2017 .

[6]  Michael K. Ng,et al.  Subspace Clustering of Text Documents with Feature Weighting K-Means Algorithm , 2005, PAKDD.

[7]  Deepayan Chakrabarti,et al.  Evolutionary clustering , 2006, KDD '06.

[8]  Jianchu Kang,et al.  A comparative study on unsupervised feature selection methods for text clustering , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[9]  Michael K. Ng,et al.  Knowledge-based vector space model for text clustering , 2010, Knowledge and Information Systems.

[10]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR.

[11]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[12]  Yan Zhang,et al.  Evolutionary timeline summarization: a balanced optimization framework via iterative substitution , 2011, SIGIR.

[13]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[14]  Matthew Hurst,et al.  Event Detection and Tracking in Social Streams , 2009, ICWSM.

[15]  Fabio Crestani,et al.  Event Detection for Heterogeneous News Streams , 2017, NLDB.

[16]  Lifu Huang,et al.  Optimized Event Storyline Generation based on Mixture-Event-Aspect Model , 2013, EMNLP.

[17]  Yan Zhang,et al.  Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction , 2013, EMNLP.

[18]  Ramesh Nallapati,et al.  Event threading within news topics , 2004, CIKM '04.

[19]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[20]  Chih-Ping Wei,et al.  Discovering Event Evolution Graphs From News Corpora , 2009, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[21]  Dafna Shahaf,et al.  Information cartography: creating zoomable, large-scale maps of information , 2013, KDD.

[22]  Christopher D. Manning,et al.  Optimizing Chinese Word Segmentation for Machine Translation Performance , 2008, WMT@ACL.

[23]  Christos Faloutsos,et al.  Fast discovery of connection subgraphs , 2004, KDD.

[24]  Louiqa Raschid,et al.  A Graph Analytical Approach for Topic Detection , 2013, TOIT.

[25]  Dafna Shahaf,et al.  Trains of thought: generating information maps , 2012, WWW.

[26]  Yiming Yang,et al.  Multi-strategy learning for topic detection and tracking: a joint report of CMU approaches to multilingual TDT , 2002 .

[27]  Xuchao Zhang,et al.  Automatical Storyline Generation with Help from Twitter , 2016, CIKM.

[28]  Maurizio Marchese,et al.  Text Clustering with Seeds Affinity Propagation , 2011, IEEE Transactions on Knowledge and Data Engineering.