Incremental and Adaptive Clustering Stream Data over Sliding Window

Cluster analysis has played a key role in data stream understanding. The problem is difficult when the clustering task is considered in a sliding window model in which the requirement of outdated data elimination must be dealt with properly. We propose SWEM algorithm that is designed based on the Expectation Maximization technique to address these challenges. Equipped in SWEM is the capability to compute clusters incrementally using a small number of statistics summarized over the stream and the capability to adapt to the stream distribution's changes. The feasibility of SWEM has been verified via a number of experiments and we show that it is superior than Clustream algorithm, for both synthetic and real datasets.

[1]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[2]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[3]  Philip S. Yu,et al.  Mining Frequent Patterns in Data Streams at Multiple Time Granularities , 2002 .

[4]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[5]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[6]  Naonori Ueda,et al.  Deterministic annealing EM algorithm , 1998, Neural Networks.

[7]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[8]  Rajeev Motwani,et al.  Maintaining variance and k-medians over data stream windows , 2003, PODS.

[9]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[10]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[11]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[12]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[13]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[14]  Aoying Zhou,et al.  Distributed Data Stream Clustering: A Fast EM-based Approach , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[15]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[16]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[17]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[18]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[19]  Johannes Gehrke,et al.  Querying and mining data streams: you only get one look a tutorial , 2002, SIGMOD '02.

[20]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..