A k-Median Algorithm with Running Time Independent of Data Size

AbstractWe give a sampling-based algorithm for the k-Median problem, with running time O(k $$(\frac{{k^2 }}{ \in } \log k)^2 $$ log $$(\frac{k}{ \in } \log k)$$ ), where k is the desired number of clusters and ∈ is a confidence parameter. This is the first k-Median algorithm with fully polynomial running time that is independent of n, the size of the data set. It gives a solution that is, with high probability, an O(1)-approximation, if each cluster in some optimal solution has Ω $$(\frac{{n \in }}{k})$$ points. We also give weakly-polynomial-time algorithms for this problem and a relaxed version of k-Median in which a small fraction of outliers can be excluded. We give near-matching lower bounds showing that this assumption about cluster size is necessary. We also present a related algorithm for finding a clustering that excludes a small number of outliers.

[1]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[2]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[3]  D. Hochbaum,et al.  A best possible approximation algorithm for the k--center problem , 1985 .

[4]  Tomás Feder,et al.  Optimal algorithms for approximate clustering , 1988, STOC '88.

[5]  Noga Alon,et al.  Testing of clustering , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  S. Dasgupta LEARNING MIXTURES OF GAUSSIANS Part I : Theory , 1999 .

[8]  David B. Shmoys,et al.  Approximation algorithms for facility location problems , 2000, APPROX.

[9]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[10]  Mikkel Thorup Quick k-Median, k-Center, and Facility Location for Sparse Graphs , 2001, ICALP.

[11]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[12]  J. Vitter,et al.  Approximations with Minimum Packing Constraint Violation , 1992 .

[13]  Christos Faloutsos,et al.  Density biased sampling: an improved method for data mining and clustering , 2000, SIGMOD '00.

[14]  Leonard Pitt,et al.  Criteria for polynomial-time (conceptual) clustering , 2004, Machine Learning.

[15]  Neal E. Young K-medians, facility location, and the Chernoff-Wald bound , 2000, SODA '00.

[16]  Rajeev Motwani,et al.  Incremental Clustering and Dynamic Information Retrieval , 2004, SIAM J. Comput..

[17]  F. Girosi,et al.  Some Extensions of the K-Means Algorithm for Image Segmentation and Pattern Classification , 1993 .

[18]  Rajeev Motwani,et al.  Incremental clustering and dynamic information retrieval , 1997, STOC '97.

[19]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[20]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[21]  Leonard Pitt,et al.  Sublinear time approximate clustering , 2001, SODA '01.

[22]  Nikhil Bansal,et al.  Correlation Clustering , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[23]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[24]  Johannes Gehrke,et al.  Scaling mining algorithms to large databases , 2002, CACM.

[25]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[26]  Rina Panigrahy,et al.  Clustering to minimize the sum of cluster diameters , 2001, STOC '01.

[27]  Jeffrey Scott Vitter,et al.  e-approximations with minimum packing constraint violation (extended abstract) , 1992, STOC '92.

[28]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[29]  Evangelos Markakis,et al.  Greedy facility location algorithms analyzed using dual fitting with factor-revealing LP , 2002, JACM.

[30]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[31]  Piotr Indyk,et al.  Sublinear time algorithms for metric space problems , 1999, STOC '99.

[32]  Kamesh Munagala,et al.  Local search heuristic for k-median and facility location problems , 2001, STOC '01.

[33]  Samir Khuller,et al.  Algorithms for facility location problems with outliers , 2001, SODA '01.

[34]  Sudipto Guha,et al.  Improved combinatorial algorithms for the facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[35]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[36]  Vijay V. Vazirani,et al.  Primal-dual approximation algorithms for metric facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[37]  C. Greg Plaxton,et al.  Optimal Time Bounds for Approximate Clustering , 2002, Machine Learning.

[38]  Sudipto Guha,et al.  Approximation algorithms for facility location problems , 2000 .

[39]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[40]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[41]  Adam Meyerson,et al.  Online facility location , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.