Clustering Data Streams: Theory and Practice

The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm's performance on synthetic and real data streams.

[1]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[2]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[3]  C. Greg Plaxton,et al.  Optimal Time Bounds for Approximate Clustering , 2002, Machine Learning.

[4]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[5]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[6]  Bruce G. Lindsay,et al.  Approximate medians and other quantiles in one pass and with limited memory , 1998, SIGMOD '98.

[7]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[8]  C. Greg Plaxton,et al.  The online median problem , 1999, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[9]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[10]  Dimitris Achlioptas,et al.  Fast computation of low rank matrix approximations , 2001, STOC '01.

[11]  Amin Saberi,et al.  A new greedy approach for facility location problems , 2002, STOC '02.

[12]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[13]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[14]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[15]  Rajeev Motwani,et al.  Incremental clustering and dynamic information retrieval , 1997, STOC '97.

[16]  Said Salhi,et al.  Discrete Location Theory , 1991 .

[17]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[18]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[19]  Piotr Indyk,et al.  Sublinear time algorithms for metric space problems , 1999, STOC '99.

[20]  Bruce G. Lindsay,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[21]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[22]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[23]  Olvi L. Mangasarian,et al.  Mathematical Programming in Data Mining , 1997, Data Mining and Knowledge Discovery.

[24]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[25]  Tomás Feder,et al.  Optimal algorithms for approximate clustering , 1988, STOC '88.

[26]  Sudipto Guha,et al.  Approximating a data stream for querying and estimation: algorithms and performance evaluation , 2002, Proceedings 18th International Conference on Data Engineering.

[27]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[28]  Pankaj K. Agarwal,et al.  Approximation algorithms for projective clustering , 2000, SODA '00.

[29]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[30]  J. Vitter,et al.  Approximations with Minimum Packing Constraint Violation , 1992 .

[31]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[32]  Alan M. Frieze,et al.  Clustering in large graphs and matrices , 1999, SODA '99.

[33]  An A Fabii,et al.  Improved Approximation Algorithms for Uncapacitated Facility Location , 1998 .

[34]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[35]  Sudipto Guha,et al.  Near-optimal sparse fourier representations via sampling , 2002, STOC '02.

[36]  Adam Meyerson,et al.  Online facility location , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[37]  Éva Tardos,et al.  Approximation algorithms for facility location problems (extended abstract) , 1997, STOC '97.

[38]  Samir Khuller,et al.  Greedy strikes back: improved facility location algorithms , 1998, SODA '98.

[39]  Piotr Indyk A sublinear time approximation scheme for clustering in metric spaces , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[40]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[41]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[42]  Alan M. Frieze,et al.  Fast monte-carlo algorithms for finding low-rank approximations , 2004, JACM.

[43]  Jiong Yang,et al.  An Approach to Active Spatial Data Mining Based on Statistical Information , 2000, IEEE Trans. Knowl. Data Eng..

[44]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[45]  Jessica H. Fong,et al.  An Approximate Lp Difference Algorithm for Massive Data Streams , 1999, Discret. Math. Theor. Comput. Sci..

[46]  Moses Charikar,et al.  Approximating min-sum k-clustering in metric spaces , 2001, STOC '01.

[47]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[48]  David J. Marchette A Statistical Method for Profiling Network Traffic , 1999, Workshop on Intrusion Detection and Network Monitoring.

[49]  Bhaba R. Sarker,et al.  Discrete location theory , 1991 .

[50]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[51]  Jiawei Zhang,et al.  Approximation algorithms for facility location problems , 2004 .

[52]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[53]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[54]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[55]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[56]  Johannes Gehrke,et al.  DEMON: Mining and Monitoring Evolving Data , 2001, IEEE Trans. Knowl. Data Eng..

[57]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[58]  Allan Borodin,et al.  Subquadratic Approximation Algorithms for Clustering Problems in High Dimensional Spaces , 2004, Machine Learning.

[59]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[60]  David B. Shmoys,et al.  Approximation algorithms for facility location problems , 2000, APPROX.

[61]  Mahesh Viswanathan,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2002, SIAM J. Comput..

[62]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[63]  Kamesh Munagala,et al.  Local search heuristic for k-median and facility location problems , 2001, STOC '01.

[64]  Wendy R. Fox,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[65]  Lydia E. Kavraki,et al.  Randomized Query Processing in Robot Path Planning , 1998, J. Comput. Syst. Sci..

[66]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[67]  Daniel A. Keim,et al.  Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering , 1999, VLDB.

[68]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS.

[69]  Jeffrey Scott Vitter,et al.  Approximation Algorithms for Geometric Median Problems , 1992, Inf. Process. Lett..

[70]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[71]  S. L. HAKIMIt AN ALGORITHMIC APPROACH TO NETWORK LOCATION PROBLEMS. , 1979 .

[72]  Leonard Pitt,et al.  Sublinear time approximate clustering , 2001, SODA '01.

[73]  Johannes Gehrke,et al.  DEMON: mining and monitoring evolving data , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[74]  Rafail Ostrovsky,et al.  Polynomial time approximation schemes for geometric k-clustering , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[75]  Lydia E. Kavraki,et al.  Randomized query processing in robot path planning , 1995, STOC '95.

[76]  Satish Rao,et al.  A Nearly Linear-Time Approximation Scheme for the Euclidean kappa-median Problem , 1999, ESA.

[77]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[78]  Jeffrey Scott Vitter,et al.  e-approximations with minimum packing constraint violation (extended abstract) , 1992, STOC '92.

[79]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[80]  O. Kariv,et al.  An Algorithmic Approach to Network Location Problems. II: The p-Medians , 1979 .

[81]  Mikkel Thorup,et al.  Quick k-Median, k-Center, and Facility Location for Sparse Graphs , 2001, SIAM J. Comput..

[82]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[83]  Sudipto Guha,et al.  Improved combinatorial algorithms for the facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[84]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[85]  Vijay V. Vazirani,et al.  Primal-dual approximation algorithms for metric facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[86]  Prabhakar Raghavan,et al.  Computing on data streams , 1999, External Memory Algorithms.

[87]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[88]  J. Ian Munro,et al.  Selection and sorting with limited storage , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).