Using Burstiness to Improve Clustering of Topics in News Streams

Specialists who analyze online news have a hard time separating the wheat from the chaff. Moreover, automatic data-mining techniques like clustering of news streams into topical groups can fully recover the underlying true class labels of data if and only if all classes are well separated. In reality, especially for news streams, this is clearly not the case. The question to ask is thus this: if we cannot recover the full C classes by clustering, what is the largest K < C clusters we can find that best resemble the K underlying classes? Using the intuition that bursty topics are more likely to correspond to important events that are of interest to analysts, we propose several new bursty vector space models (B-VSM)for representing a news document. B-VSM takes into account the burstiness (across the full corpus and whole duration) of each constituent word in a document at the time of publication. We benchmarked our B-VSM against the classical TFIDF-VSM on the task of clustering a collection of news stream articles with known topic labels. Experimental results show that B-VSM was able to find the burstiest clusters/topics. Further, it also significantly improved the recall and precision for the top K clusters/topics.

[1]  Jan Beran,et al.  Statistics for long-memory processes , 1994 .

[2]  Dennis Shasha,et al.  Efficient elastic burst detection in data streams , 2003, KDD '03.

[3]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[4]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[5]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[6]  Qi He,et al.  Anticipatory Event Detection via Sentence Classification , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[7]  Philip S. Yu,et al.  Parameter Free Bursty Events Detection in Text Streams , 2005, VLDB.

[8]  Ee-Peng Lim,et al.  Analyzing feature trajectories for event detection , 2007, SIGIR.

[9]  M. Sherwood-Smith,et al.  Lexical chains for topic tracking , 2002, IEEE International Conference on Systems, Man and Cybernetics.

[10]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[11]  Gennady Samorodnitsky,et al.  Long memory and self-similar processes , 2006 .

[12]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[13]  Martin Franz,et al.  Unsupervised and supervised clustering for topic tracking , 2001, SIGIR '01.

[14]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[15]  Thorsten Brants,et al.  A System for new event detection , 2003, SIGIR.

[16]  Qi He,et al.  A Model for Anticipatory Event Detection , 2006, ER.

[17]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR.

[18]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[19]  Walter Willinger,et al.  On the Self-Similar Nature of Ethernet Traffic ( extended version ) , 1995 .

[20]  Hector Garcia-Molina,et al.  Overview of multidatabase transaction management , 2005, The VLDB Journal.

[21]  Walter Willinger,et al.  On the self-similar nature of Ethernet traffic , 1993, SIGCOMM '93.

[22]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[23]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[24]  Qi He,et al.  Bursty Feature Representation for Clustering Text Streams , 2007, SDM.

[25]  Yiming Yang,et al.  Topic-conditioned novelty detection , 2002, KDD.

[26]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.