Analyzing feature trajectories for event detection

We consider the problem of analyzing word trajectories in both time and frequency domains, with the specific goal of identifying important and less-reported, periodic and aperiodic words. A set of words with identical trends can be grouped together to reconstruct an event in a completely un-supervised manner. The document frequency of each word across time is treated like a time series, where each element is the document frequency - inverse document frequency (DFIDF) score at one time point. In this paper, we 1) first applied spectral analysis to categorize features for different event characteristics: important and less-reported, periodic and aperiodic; 2) modeled aperiodic features with Gaussian density and periodic features with Gaussian mixture densities, and subsequently detected each feature's burst by the truncated Gaussian approach; 3) proposed an unsupervised greedy event detection algorithm to detect both aperiodic and periodic events. All of the above methods can be applied to time series data in general. We extensively evaluated our methods on the 1-year Reuters News Corpus [3] and showed that they were able to uncover meaningful aperiodic and periodic events.

[1]  James Allan,et al.  Topic Detection and Tracking , 2002, The Information Retrieval Series.

[2]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[3]  Thorsten Brants,et al.  A System for new event detection , 2003, SIGIR.

[4]  Philip S. Yu,et al.  Parameter Free Bursty Events Detection in Text Streams , 2005, VLDB.

[5]  Hector Garcia-Molina,et al.  Overview of multidatabase transaction management , 2005, The VLDB Journal.

[6]  Joe Carthy,et al.  Combining semantic and syntactic document classifiers to improve first story detection , 2001, SIGIR '01.

[7]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[8]  Dimitrios Gunopulos,et al.  Identifying similarities, periodicities and bursts for online search queries , 2004, SIGMOD '04.

[9]  James Allan,et al.  First story detection in TDT is hard , 2000, CIKM '00.

[10]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[11]  James Allan,et al.  Automatic generation of overview timelines , 2000, SIGIR '00.

[12]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  James Allan,et al.  Retrieval and novelty detection at the sentence level , 2003, SIGIR.

[15]  Yiming Yang,et al.  Topic-conditioned novelty detection , 2002, KDD.

[16]  Qi He,et al.  A Model for Anticipatory Event Detection , 2006, ER.

[17]  Ravi Kumar,et al.  On the Bursty Evolution of Blogspace , 2003, WWW '03.

[18]  Qi He,et al.  Bursty Feature Representation for Clustering Text Streams , 2007, SDM.

[19]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.