Linear-time approximation schemes for clustering problems in any dimensions

We present a general approach for designing approximation algorithms for a fundamental class of geometric clustering problems in arbitrary dimensions. More specifically, our approach leads to simple randomized algorithms for the <i>k</i>-means, <i>k</i>-median and discrete <i>k</i>-means problems that yield (1+ϵ) approximations with probability ≥ 1/2 and running times of <i>O</i>(2<sup>(<i>k</i>/ϵ)<sup><i>O</i>(1)</sup></sup> <i>dn</i>). These are the first algorithms for these problems whose running times are linear in the size of the input (<i>nd</i> for <i>n</i> points in <i>d</i> dimensions) assuming <i>k</i> and ϵ are fixed. Our method is general enough to be applicable to clustering problems satisfying certain simple properties and is likely to have further applications.

[1]  S. Dasgupta The hardness of k-means clustering , 2008 .

[2]  M. Inaba Application of weighted Voronoi diagrams and randomization to variance-based k-clustering , 1994, SoCG 1994.

[3]  Ke Chen,et al.  On k-Median clustering in high dimensions , 2006, SODA '06.

[4]  Marek Karpinski,et al.  Approximation schemes for clustering problems , 2003, STOC '03.

[5]  Venkatesan Guruswami,et al.  Embeddings and non-approximability of geometric problems , 2003, SODA '03.

[6]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[7]  Sanjeev Arora,et al.  Polynomial time approximation schemes for Euclidean TSP and other geometric problems , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[8]  Yogish Sabharwal,et al.  A linear time algorithm for approximate 2-means clustering , 2005, Comput. Geom..

[9]  J. Matou On Approximate Geometric K-clustering , 1999 .

[10]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[11]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[12]  Jirí Matousek,et al.  On Approximate Geometric k -Clustering , 2000, Discret. Comput. Geom..

[13]  Amit Kumar,et al.  A simple linear time (1 + /spl epsiv/)-approximation algorithm for k-means clustering in any dimensions , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[14]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[15]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[16]  Sanjeev Arora,et al.  Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems , 1998, JACM.

[17]  Amit Kumar,et al.  Linear Time Algorithms for Clustering Problems in Any Dimensions , 2005, ICALP.

[18]  Satish Rao,et al.  A Nearly Linear-Time Approximation Scheme for the Euclidean k-Median Problem , 2007, SIAM J. Comput..

[19]  Sariel Har-Peled,et al.  Coresets for $k$-Means and $k$-Median Clustering and their Applications , 2018, STOC 2004.

[20]  Dan Suciu,et al.  Journal of the ACM , 2006 .

[21]  K. Wakimoto,et al.  Efficient and Effective Querying by Image Content , 1994 .

[22]  R. Motwani,et al.  High-Dimensional Computational Geometry , 2000 .

[23]  D. Eppstein,et al.  Approximation algorithms for geometric problems , 1996 .

[24]  Satish Rao,et al.  A Nearly Linear-Time Approximation Scheme for the Euclidean kappa-median Problem , 1999, ESA.

[25]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[26]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[27]  David G. Stork,et al.  Pattern Classification , 1973 .

[28]  Michael J. Swain,et al.  Color indexing , 1991, International Journal of Computer Vision.

[29]  Christos Faloutsos,et al.  Efficient and effective Querying by Image Content , 1994, Journal of Intelligent Information Systems.

[30]  Rafail Ostrovsky,et al.  The Effectiveness of Lloyd-Type Methods for the k-Means Problem , 2006, FOCS.

[31]  Mary Inaba,et al.  Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.