Tracking clusters in evolving data streams over sliding windows

Mining data streams poses great challenges due to the limited memory availability and real-time query response requirement. Clustering an evolving data stream is especially interesting because it captures not only the changing distribution of clusters but also the evolving behaviors of individual clusters. In this paper, we present a novel method for tracking the evolution of clusters over sliding windows. In our SWClustering algorithm, we combine the exponential histogram with the temporal cluster features, propose a novel data structure, the Exponential Histogram of Cluster Features (EHCF). The exponential histogram is used to handle the in-cluster evolution, and the temporal cluster features represent the change of the cluster distribution. Our approach has several advantages over existing methods: (1) the quality of the clusters is improved because the EHCF captures the distribution of recent records precisely; (2) compared with previous methods, the mechanism employed to adaptively maintain the in-cluster synopsis can track the cluster evolution better, while consuming much less memory; (3) the EHCF provides a flexible framework for analyzing the cluster evolution and tracking a specific cluster efficiently without interfering with other clusters, thus reducing the consumption of computing resources for data stream clustering. Both the theoretical analysis and extensive experiments show the effectiveness and efficiency of the proposed method.

[1]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[2]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[3]  Philip S. Yu,et al.  Moment: maintaining closed frequent itemsets over a stream sliding window , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[4]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[5]  Sudipto Guha,et al.  Clustering data streams , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[6]  Geoff Hulten,et al.  A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering , 2001, ICML.

[7]  Eamonn J. Keogh,et al.  Clustering of time-series subsequences is meaningless: implications for previous and future research , 2004, Knowledge and Information Systems.

[8]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[9]  Philip S. Yu,et al.  On High Dimensional Projected Clustering of Data Streams , 2005, Data Mining and Knowledge Discovery.

[10]  Eyke Hüllermeier,et al.  Online clustering of parallel data streams , 2006, Data Knowl. Eng..

[11]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[12]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[13]  Carlos Ordonez,et al.  Clustering binary data streams with K-means , 2003, DMKD '03.

[14]  He Zengyou,et al.  Squeezer: an efficient algorithm for clustering categorical data , 2002 .

[15]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[16]  KeoghEamonn,et al.  Clustering of time-series subsequences is meaningless , 2005 .

[17]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[18]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[19]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[20]  Philip S. Yu,et al.  Catch the moment: maintaining closed frequent itemsets over a data stream sliding window , 2006, Knowledge and Information Systems.

[21]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[22]  Suh-Yin Lee,et al.  Efficient mining of sequential patterns with time constraints by delimited pattern growth , 2005, Knowledge and Information Systems.

[23]  Ming-Syan Chen,et al.  Adaptive Clustering for Multiple Evolving Streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[24]  徐晓飞,et al.  Squeezer:An Efficient Algorithm for Clustering Categorical Data , 2002 .

[25]  Jiong Yang Dynamic clustering of evolving streams with a single pass , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[26]  Rajeev Motwani,et al.  Maintaining variance and k-medians over data stream windows , 2003, PODS.

[27]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[28]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[29]  Olfa Nasraoui,et al.  Mining Evolving User Profiles in Noisy Web Clickstream Data with a Scalable Immune System Clustering Algorithm , 2003 .

[30]  Aoying Zhou,et al.  Dynamically maintaining frequent items over a data stream , 2003, CIKM '03.

[31]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[32]  Jeffrey Scott Vitter,et al.  Mining deviants in time series data streams , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[33]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[34]  Geoff Hulten,et al.  Catching up with the Data: Research Issues in Mining Data Streams , 2001, DMKD.

[35]  Philip S. Yu,et al.  A Framework for Clustering Massive Text and Categorical Data Streams , 2006, SDM.

[36]  Ming-Syan Chen,et al.  Clustering on demand for multiple data streams , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[37]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[38]  Fabio A. González,et al.  TECNO-STREAMS: tracking evolving clusters in noisy data streams with a scalable immune system learning model , 2003, Third IEEE International Conference on Data Mining.

[39]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[40]  Li Wei,et al.  M-kernel merging: towards density estimation over data streams , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[41]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[42]  João Gama,et al.  Hierarchical Time-Series Clustering for Data Streams ⋆ , 2004 .

[43]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[44]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.