Small Space Representations for Metric Min-sum k-Clustering and Their Applications

AbstractThe min-sumk-clustering problem is to partition a metric space (P,d) into k clusters C1,…,Ck⊆P such that $\sum_{i=1}^{k}\sum_{p,q\in C_{i}}d(p,q)$ is minimized. We show the first efficient construction of a coreset for this problem. Our coreset construction is based on a new adaptive sampling algorithm. With our construction of coresets we obtain two main algorithmic results.The first result is a sublinear-time (4+ε)-approximation algorithm for the min-sum k-clustering problem in metric spaces. The running time of this algorithm is $\widetilde{{\mathcal{O}}}(n)$ for any constant k and ε, and it is o(n2) for all k=o(log n/log log n). Since the full description size of the input is Θ(n2), this is sublinear in the input size. The fastest previously known o(log n)-factor approximation algorithm for k>2 achieved a running time of Ω(nk), and no non-trivial o(n2)-time algorithm was known before.Our second result is the first pass-efficient data streaming algorithm for min-sum k-clustering in the distance oracle model, i.e., an algorithm that uses poly(log n,k) space and makes 2 passes over the input point set, which arrives in form of a data stream in arbitrary order. It computes an implicit representation of a clustering of (P,d) with cost at most a constant factor larger than that of an optimal partition. Using one further pass, we can assign each point to its corresponding cluster.To develop the coresets, we introduce the concept of α-preserving metric embeddings. Such an embedding satisfies properties that the distance between any pair of points does not decrease and the cost of an optimal solution for the considered problem on input (P,d′) is within a constant factor of the optimal solution on input (P,d). In other words, the goal is to find a metric embedding into a (structurally simpler) metric space that approximates the original metric up to a factor of αwith respect to a given problem. We believe that this concept is an interesting generalization of coresets.

[1]  Artur Czumaj,et al.  Sublinear-Time Approximation for Clustering Via Random Sampling , 2004, ICALP.

[2]  Moses Charikar,et al.  Approximating min-sum k-clustering in metric spaces , 2001, STOC '01.

[3]  Piotr Indyk,et al.  Algorithms for dynamic geometric problems over data streams , 2004, STOC '04.

[4]  Artur Czumaj,et al.  Sublinear-Time Algorithms , 2006, Bull. EATCS.

[5]  Sariel Har-Peled Clustering Motion , 2004, Discret. Comput. Geom..

[6]  Marek Karpinski,et al.  Approximation schemes for clustering problems , 2003, STOC '03.

[7]  Leonard Pitt,et al.  Sublinear time approximate clustering , 2001, SODA '01.

[8]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[9]  Adam Meyerson,et al.  Online facility location , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[10]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[11]  F. Frances Yao,et al.  Computational Geometry , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[12]  Amit Kumar,et al.  A simple linear time (1 + /spl epsiv/)-approximation algorithm for k-means clustering in any dimensions , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[13]  Noga Alon,et al.  Testing of clustering , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[14]  Refael Hassin,et al.  Approximation Algorithms for Min-sum p-clustering , 1998, Discret. Appl. Math..

[15]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[16]  Gert Vegter,et al.  In handbook of discrete and computational geometry , 1997 .

[17]  Takeshi Tokuyama,et al.  Geometric Algorithms for the Minimum Cost Assignment Problem , 1995, Random Struct. Algorithms.

[18]  GoldreichOded,et al.  Property testing and its connection to learning and approximation , 1998 .

[19]  A. Czumaj,et al.  Sublinear-time approximation algorithms for clustering via random sampling , 2007 .

[20]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[21]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[22]  Ke Chen,et al.  On k-Median clustering in high dimensions , 2006, SODA '06.

[23]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[24]  Christian Sohler,et al.  Coresets in dynamic geometric data streams , 2005, STOC '05.

[25]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[26]  Artur Czumaj,et al.  Abstract Combinatorial Programs and Efficient Property Testers , 2005, SIAM J. Comput..

[27]  David P. Williamson,et al.  Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[28]  Amit Kumar,et al.  Linear Time Algorithms for Clustering Problems in Any Dimensions , 2005, ICALP.

[29]  R. Motwani,et al.  High-Dimensional Computational Geometry , 2000 .

[30]  Adam Meyerson,et al.  A k-Median Algorithm with Running Time Independent of Data Size , 2004, Machine Learning.

[31]  Leonard J. Schulman,et al.  Clustering for Edge-Cost Minimization , 1999, Electron. Colloquium Comput. Complex..

[32]  C. Greg Plaxton,et al.  Optimal Time Bounds for Approximate Clustering , 2002, Machine Learning.

[33]  Mikkel Thorup Quick k-Median, k-Center, and Facility Location for Sparse Graphs , 2001, ICALP.

[34]  Teofilo F. Gonzalez,et al.  P-Complete Approximation Problems , 1976, J. ACM.

[35]  Claire Mathieu,et al.  A Randomized Approximation Scheme for Metric MAX-CUT , 1998, FOCS.

[36]  Ittai Abraham,et al.  Advances in metric embedding theory , 2006, STOC '06.

[37]  Piotr Indyk,et al.  Sublinear time algorithms for metric space problems , 1999, STOC '99.

[38]  Amit Kumar,et al.  A simple linear time ( 1+ ε)- approximation algorithm for geometric k-means clustering in any dimensions , 2004 .

[39]  Rajeev Motwani,et al.  Incremental Clustering and Dynamic Information Retrieval , 2004, SIAM J. Comput..

[40]  Piotr Indyk,et al.  Facility Location in Sublinear Time , 2005, ICALP.

[41]  Sariel Har-Peled,et al.  Projective clustering in high dimensions using core-sets , 2002, SCG '02.

[42]  Yair Bartal,et al.  Probabilistic approximation of metric spaces and its algorithmic applications , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[43]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[44]  Dana Ron,et al.  Property testing and its connection to learning and approximation , 1998, JACM.

[45]  Leonard J. Schulman,et al.  Clustering for edge-cost minimization (extended abstract) , 2000, STOC '00.

[46]  Piotr Indyk A sublinear time approximation scheme for clustering in metric spaces , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[47]  Yair Bartal,et al.  On approximating arbitrary metrices by tree metrics , 1998, STOC '98.

[48]  Ronitt Rubinfeld,et al.  Tolerant property testing and distance approximation , 2006, J. Comput. Syst. Sci..

[49]  Divesh Srivastava,et al.  Reverse Nearest Neighbor Aggregates Over Data Streams , 2002, VLDB.

[50]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[51]  Satish Rao,et al.  A tight bound on approximating arbitrary metrics by tree metrics , 2003, STOC '03.

[52]  Dimitris Fotakis,et al.  Memoryless facility location in one pass , 2006, TALG.

[53]  Yuri Rabinovich On average distortion of embedding metrics into the line and into L1 , 2003, STOC '03.