A Framework for Clustering Massive-Domain Data Streams

In this paper, we will examine the problem of clustering massive domain data streams. Massive-domain data streams are those in which the number of possible domain values for each attribute are very large and cannot be easily tracked for clustering purposes. Some examples of such streams include IP-address streams, credit-card transaction streams, or streams of sales data over large numbers of items. In such cases, it is well known that even simple stream operations such as counting can be extremely difficult because ofthe difficulty in maintaining summary information over the different discrete values. The task of clustering is significantly more challenging in such cases, since the intermediate statistics for the different clusters cannot be maintained efficiently. In this paper, we propose a method for clustering massive-domain data streams with the use of sketches. We prove probabilistic results which show that a sketch-based clustering method can provide similar results to an infinite-space clustering algorithm with high probability. We present experimental results which validate these theoretical results, and show that it is possible to approximate the behavior of an infinite-space algorithm accurately.

[1]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[2]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[3]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[4]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[5]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[6]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[7]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[8]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[9]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[10]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[11]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[12]  Philip S. Yu,et al.  A Framework for Clustering Massive Text and Categorical Data Streams , 2006, SDM.

[13]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[14]  Kai Zhao,et al.  Bounding and Estimating Association Rule Support from Clusters on Binary Data , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[15]  Ee-Peng Lim,et al.  SCLOPE: An Algorithm for Clustering Data Streams of Categorical Attributes , 2004, DaWaK.

[16]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[17]  Alfred V. Aho,et al.  Data Structures and Algorithms , 1983 .

[18]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[19]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[20]  Shi Zhong,et al.  Efficient streaming text clustering , 2005, Neural Networks.

[21]  Carlos Ordonez,et al.  Clustering binary data streams with K-means , 2003, DMKD '03.

[22]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.