Adaptive online event detection in news streams

Event detection aims to discover news documents that report on the same event and arrange them under the same group. With the explosive growth of online news, there is a need for event detection to facilitate better navigation for users in news spaces. Existing works usually represent documents based on TF-IDF scheme and use a clustering algorithm for event detection. However, traditional TF-IDF vector representation suffers problems of high dimension and sparse semantics. In addition, with more news documents coming, IDF need to be incrementally updated. In this paper, we present a novel document representation method based on word embeddings, which reduces the dimension and alleviates the sparse semantics compared to TF-IDF, and thus improves the efficiency and accuracy. Based on the document representation, we propose an adaptive online clustering method for online news event detection, which improves both the precision and recall by using time slicing and event merging respectively. The resulted events are further improved by an adaptive post-processing step which can automatically detect noisy events and further process them. Experiments on standard and real-world datasets show that our proposed adaptive online event detection method significantly improves the performance of event detection in terms of both efficiency and accuracy compared to state-of-the-art methods.

[1]  Liang Zhou,et al.  Using Topic Modeling and Similarity Thresholds to Detect Events , 2015, EVENTS@HLP-NAACL.

[2]  Kuo Zhang,et al.  New event detection based on indexing-tree and named entity , 2007, SIGIR.

[3]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[4]  Bu-Sung Lee,et al.  Event Detection in Twitter , 2011, ICWSM.

[5]  Yan Jia,et al.  Online Burst Detection Over High Speed Short Text Streams , 2007, International Conference on Computational Science.

[6]  Ramesh Nallapati,et al.  Event threading within news topics , 2004, CIKM '04.

[7]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[8]  Haofen Wang,et al.  Towards Effective Event Detection, Tracking and Summarization on Microblog Data , 2011, WAIM.

[9]  Yiming Yang,et al.  Learning approaches for detecting and tracking news events , 1999, IEEE Intell. Syst..

[10]  Xiangying Dai,et al.  Event identification within news topics , 2010, 2010 International Conference on Intelligent Computing and Integrated Systems.

[11]  Peng Zhang,et al.  NewsMiner: Multifaceted news analysis for event search , 2015, Knowl. Based Syst..

[12]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[13]  Qi He,et al.  Bursty Feature Representation for Clustering Text Streams , 2007, SDM.

[14]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR Forum.

[15]  Chun Chen,et al.  Online detection of bursty events and their evolution in news streams , 2010, Journal of Zhejiang University SCIENCE C.

[16]  Ee-Peng Lim,et al.  Analyzing feature trajectories for event detection , 2007, SIGIR.

[17]  Roberto Frias,et al.  Twitter event detection: combining wavelet analysis and topic inference summarization , 2011 .

[18]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  James Allan,et al.  Taking Topic Detection From Evaluation to Practice , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[21]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[22]  Xiaolong Wang,et al.  Online topic detection and tracking of financial news based on hierarchical clustering , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[23]  Yiming Yang,et al.  Topic-conditioned novelty detection , 2002, KDD.

[24]  Ambuj K. Singh,et al.  A unified framework for monitoring data streams in real time , 2005, 21st International Conference on Data Engineering (ICDE'05).

[25]  Helen M. Meng,et al.  Using contextual analysis for news event detection , 2001, Int. J. Intell. Syst..

[26]  Hector Garcia-Molina,et al.  Overview of multidatabase transaction management , 2005, The VLDB Journal.

[27]  Dimitrios Gunopulos,et al.  Identifying similarities, periodicities and bursts for online search queries , 2004, SIGMOD '04.

[28]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[29]  Bin Wang,et al.  A probabilistic model for retrospective news event detection , 2005, SIGIR '05.

[30]  Hila Becker,et al.  Learning similarity metrics for event identification in social media , 2010, WSDM '10.

[31]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.