Extracting hot spots of basic and complex topics from time stamped documents

Identifying time periods with a burst of activity related to a topic has been an important problem in analyzing time stamped documents. In this paper, we discuss methods to compute a hot spot of a given topic from a time stamped document set. We consider basic topics that contain one or more keywords as well as complex topics that contain topics connected by logical operators and, or, not. We use the temporal scan statistic to assign a discrepancy score to each of the intervals of the time period spanning the given document set. The hot spot of the given topic is the time interval with the highest discrepancy score. We describe efficient algorithms to compute the hot spots of both basic and complex topics. Our preliminary experiments using the SIGMOD/VLDB paper titles data set and the CNN/Reuters news article titles data set collected from the TDT-Pilot Corpus show that our methods to compute the measure and the hot spot of a topic work very well in practice.

[1]  Richard Sproat,et al.  Mining correlated bursty topic patterns from coordinated text streams , 2007, KDD '07.

[2]  Ravi Kumar,et al.  On the Bursty Evolution of Blogspace , 2003, WWW '03.

[3]  Robert Haining,et al.  Crime in Border Regions: The Scandinavian Case of Öresund, 1998–2001 , 2004, Annals of the Association of American Geographers.

[4]  Clive Oppenheimer,et al.  Mortality in England during the 1783–4 Laki Craters eruption , 2004 .

[5]  M Kulldorff,et al.  Spatial disease clusters: detection and inference. , 1995, Statistics in medicine.

[6]  Andrew W. Moore,et al.  Detecting Significant Multidimensional Spatial Clusters , 2004, NIPS.

[7]  Hector Garcia-Molina,et al.  Overview of multidatabase transaction management , 2005, The VLDB Journal.

[8]  Andrew W. Moore,et al.  A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters , 2003, NIPS.

[9]  Daniel J. Rosenkrantz,et al.  Constructing Time Decompositions for Analyzing Time-Stamped Documents , 2004, SDM.

[10]  Zhengyuan Zhu,et al.  Spatial scan statistics: approximations and performance study , 2006, KDD '06.

[11]  KumarRavi,et al.  On the Bursty Evolution of Blogspace , 2005 .

[12]  Philip S. Yu,et al.  Parameter Free Bursty Events Detection in Text Streams , 2005, VLDB.

[13]  René Witte,et al.  Fuzzy Clustering for Topic Analysis and Summarization of Document Collections , 2007, Canadian Conference on AI.

[14]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[15]  Andrew W. Moore,et al.  Rapid detection of significant spatial clusters , 2004, KDD.

[16]  M. Kulldor,et al.  Prospective time-periodic geographical disease surveillance using a scan statistic , 2001 .