Clustering Text Data Streams

Clustering text data streams is an important issue in data mining community and has a number of applications such as news group filtering, text crawling, document organization and topic detection and tracing etc. However, most methods are similarity-based approaches and only use the TF*IDF scheme to represent the semantics of text data and often lead to poor clustering quality. Recently, researchers argue that semantic smoothing model is more efficient than the existing TF*IDF scheme for improving text clustering quality. However, the existing semantic smoothing model is not suitable for dynamic text data context. In this paper, we extend the semantic smoothing model into text data streams context firstly. Based on the extended model, we then present two online clustering algorithms OCTS and OCTSM for the clustering of massive text data streams. In both algorithms, we also present a new cluster statistics structure named cluster profile which can capture the semantics of text data streams dynamically and at the same time speed up the clustering process. Some efficient implementations for our algorithms are also given. Finally, we present a series of experimental results illustrating the effectiveness of our technique.

[1]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[2]  Xiaohua Hu,et al.  Context-sensitive semantic smoothing for the language modeling approach to genomic IR , 2006, SIGIR.

[3]  Xiaohua Hu,et al.  Semantic Smoothing for Model-based Document Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[4]  Aoying Zhou,et al.  Approximately Processing Multi-granularity Aggregate Queries over Data Streams , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[5]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[6]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[7]  Philip S. Yu,et al.  Suppressing model overfitting in mining concept-drifting data streams , 2006, KDD '06.

[8]  Yixin Chen,et al.  Multi-Dimensional Regression Analysis of Time-Series Data Streams , 2002, VLDB.

[9]  Jian Pei,et al.  Granularity Adaptive Density Estimation and on Demand Clustering of Concept-Drifting Data Streams , 2006, DaWaK.

[10]  Won Suk Lee,et al.  Effect of Count Estimation in Finding Frequent Itemsets over Online Transactional Data Streams , 2005, Journal of Computer Science and Technology.

[11]  Hector Garcia-Molina,et al.  Overview of multidatabase transaction management , 2005, The VLDB Journal.

[12]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[13]  Arindam Banerjee,et al.  Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning , 2007, SDM.

[14]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[15]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[16]  Hongjun Lu,et al.  Classifying Text Streams in the Presence of Concept Drifts , 2004, PAKDD.

[17]  Shi Zhong,et al.  Efficient streaming text clustering , 2005, Neural Networks.

[18]  Rui Zhou,et al.  Load Shedding for Window Joins over Streams , 2006, WAIM.

[19]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[20]  Qi He,et al.  Bursty Feature Representation for Clustering Text Streams , 2007, SDM.

[21]  Charu C. Aggarwal,et al.  A framework for diagnosing changes in evolving data streams , 2003, SIGMOD '03.

[22]  Yixin Chen,et al.  Stream Cube: An Architecture for Multi-Dimensional Analysis of Data Streams , 2005, Distributed and Parallel Databases.

[23]  Philip S. Yu,et al.  A Framework for Clustering Massive Text and Categorical Data Streams , 2006, SDM.

[24]  Philip S. Yu,et al.  Parameter Free Bursty Events Detection in Text Streams , 2005, VLDB.

[25]  Jian Yin,et al.  Document Clustering Based on Semantic Smoothing Approach , 2007, AWIC.

[26]  Yangyong Zhu,et al.  L-Tree Match: A New Data Extraction Model and Algorithm for Huge Text Stream with Noises , 2005, Journal of Computer Science and Technology.

[27]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[28]  Aoying Zhou,et al.  Efficient Computation of k-Medians over Data Streams Under Memory Constraints , 2006, Journal of Computer Science and Technology.

[29]  Qiang Yang,et al.  Thread detection in dynamic text message streams , 2006, SIGIR.

[30]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[31]  Jian Yin,et al.  Clustering Massive Text Data Streams by Semantic Smoothing Model , 2007, ADMA.

[32]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[33]  Philip S. Yu,et al.  Online mining of data streams: applications, techniques and progress , 2005, 21st International Conference on Data Engineering (ICDE'05).