Adaptive Grid-Based k-median Clustering of Streaming Data with Accuracy Guarantee

Data stream clustering has wide applications, such as online financial transactions, telephone records, and network monitoring. Grid-based clustering partitions stream data into cells, derives statistical information of the cells, and then applies clustering on these much smaller statistical information without referring to the input data. Therefore, grid-based clustering is efficient and very suitable for high-throughput data streams, which are continuous, time-varying, and possibly unpredictable. Various grid-based clustering schemes have been proposed. However, to the best of our knowledge, none of them provides an accuracy guarantee for their clustering output. To fill this gap, in this paper we study grid-based k-median clustering. We first develop an accuracy guarantee on the cost difference between grid-based solution and the optimum. Based on the theoretical analysis, we then propose a general and adaptive solution, which partitions stream data into cells of dynamically determined granularity and runs k-median clustering on the statistical information of cells with an accuracy guarantee. An extensive experiment over three real datasets clearly shows that our solution provides high-quality clustering outputs in an efficient way.

[1]  Jian Pei,et al.  SNOC: Streaming Network Node Classification , 2014, 2014 IEEE International Conference on Data Mining.

[2]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[3]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[4]  Kamesh Munagala,et al.  Local search heuristic for k-median and facility location problems , 2001, STOC '01.

[5]  Johannes Blömer,et al.  Coresets and approximate clustering for Bregman divergences , 2009, SODA.

[6]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[7]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[8]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[9]  Jeffrey Scott Vitter,et al.  e-approximations with minimum packing constraint violation (extended abstract) , 1992, STOC '92.

[10]  Won Suk Lee,et al.  Cell trees: An adaptive synopsis structure for clustering multi-dimensional on-line data streams , 2007, Data Knowl. Eng..

[11]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[12]  Graham Cormode,et al.  Conquering the Divide: Continuous Clustering of Distributed Data Streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[13]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[14]  Sudipto Guha,et al.  Improved combinatorial algorithms for the facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[15]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[16]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[17]  Vijay V. Vazirani,et al.  Primal-dual approximation algorithms for metric facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[18]  Sudipto Guha,et al.  Rounding via Trees : Deterministic Approximation Algorithms forGroup , 1998 .

[19]  Qi Zhang,et al.  Approximate Clustering on Distributed Data Streams , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[20]  Jeffrey Scott Vitter,et al.  Approximation Algorithms for Geometric Median Problems , 1992, Inf. Process. Lett..

[21]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[22]  Christian Sohler,et al.  StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[23]  João Gama,et al.  Clustering distributed sensor data streams using local processing and reduced communication , 2011, Intell. Data Anal..

[24]  Beng Chin Ooi,et al.  Approximate NN queries on Streams with Guaranteed Error/performance Bounds , 2004, VLDB.

[25]  Yin Yang,et al.  C-Cube: Elastic continuous clustering in the cloud , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[26]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[27]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[28]  Yufei Tao,et al.  Random Sampling for Continuous Streams with Arbitrary Updates , 2007, IEEE Transactions on Knowledge and Data Engineering.

[29]  Yair Bartal,et al.  Probabilistic approximation of metric spaces and its algorithmic applications , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[30]  Nir Ailon,et al.  Streaming k-means approximation , 2009, NIPS.

[31]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[32]  Won Suk Lee,et al.  Statistical grid-based clustering over data streams , 2004, SGMD.