Clustering Geometric Data Streams

Using recent knowledge in data stream clustering we present a modified approach to the facility location problem in the context of geometric data streams. We give insight to the existing algorithm from a less mathematical point of view, focusing on understanding and practical use, namely by computer graphics experts. We propose a modification of the original data stream k-median clustering to solve facility location which is the case when we a priori do not know the number of clusters in the input data. Like the original, the modified version is capable of processing millions of points while using rather small amount of memory. Based on our experiments with clustering geometric data we present suggestions on how to set processing parameters. We also describe how the algorithm handles various distributions of input data within the stream. These findings may be applied back to the original algorithm. CR Categories: I.5.3 [Computing Methodologies]: Pattern Recognition—Clustering; I.3.5 [Computing Methodologies]: Computer Graphics—Computational Geometry and Object Modeling

[1]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[2]  Sudipto Guha,et al.  Clustering data streams , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[3]  Fabián A. Chudak,et al.  Improved Approximation Algorithms for the Uncapacitated Facility Location Problem , 2003, SIAM J. Comput..

[4]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[5]  Sudipto Guha,et al.  Improved combinatorial algorithms for the facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[6]  David B. Shmoys,et al.  Approximation algorithms for facility location problems , 2000, APPROX.

[7]  Vijay V. Vazirani,et al.  Primal-dual approximation algorithms for metric facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[8]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[9]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[10]  Adam Meyerson,et al.  Online facility location , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[11]  Cyrus Shahabi,et al.  Approximate Voronoi Cell Computation on Geometric Data Streams , 2004 .

[12]  Samir Khuller,et al.  Algorithms for facility location problems with outliers , 2001, SODA '01.

[13]  Martin Isenburg,et al.  Streaming compression of tetrahedral volume meshes , 2006, Graphics Interface.

[14]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[15]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[16]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[17]  Martin Isenburg,et al.  Streaming meshes , 2005, VIS 05. IEEE Visualization, 2005..

[18]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[19]  Jarek Rossignac,et al.  Multi-resolution 3D approximations for rendering complex scenes , 1993, Modeling in Computer Graphics.

[20]  Samir Khuller,et al.  Greedy strikes back: improved facility location algorithms , 1998, SODA '98.

[21]  Evangelos Markakis,et al.  A Greedy Facility Location Algorithm Analyzed Using Dual Fitting , 2001, RANDOM-APPROX.

[22]  Martin Isenburg,et al.  Streaming compression of triangle meshes , 2005, SIGGRAPH '05.

[23]  Peter Lindstrom,et al.  Out-of-core simplification of large polygonal models , 2000, SIGGRAPH.

[24]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[25]  Renato Pajarola,et al.  Stream-processing points , 2005, VIS 05. IEEE Visualization, 2005..

[26]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[27]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[28]  Martin Isenburg,et al.  Out-of-core compression for gigantic polygon meshes , 2003, ACM Trans. Graph..

[29]  Christian Sohler,et al.  Coresets in dynamic geometric data streams , 2005, STOC '05.

[30]  J. Shewchuk,et al.  Streaming computation of Delaunay triangulations , 2006, SIGGRAPH '06.

[31]  Rajmohan Rajaraman,et al.  Analysis of a local search heuristic for facility location problems , 2000, SODA '98.