A Fast Approximation Scheme for Low-Dimensional k-Means

We consider the popular $k$-means problem in $d$-dimensional Euclidean space. Recently Friggstad, Rezapour, Salavatipour [FOCS'16] and Cohen-Addad, Klein, Mathieu [FOCS'16] showed that the standard local search algorithm yields a $(1+\epsilon)$-approximation in time $(n \cdot k)^{1/\epsilon^{O(d)}}$, giving the first polynomial-time approximation scheme for the problem in low-dimensional Euclidean space. While local search achieves optimal approximation guarantees, it is not competitive with the state-of-the-art heuristics such as the famous $k$-means++ and $D^2$-sampling algorithms. In this paper, we aim at bridging the gap between theory and practice by giving a $(1+\epsilon)$-approximation algorithm for low-dimensional $k$-means running in time $n \cdot k \cdot (\log n)^{(d\epsilon^{-1})^{O(d)}}$, and so matching the running time of the $k$-means++ and $D^2$-sampling heuristics up to polylogarithmic factors. We speed-up the local search approach by making a non-standard use of randomized dissections that allows to find the best local move efficiently using a quite simple dynamic program. We hope that our techniques could help design better local search heuristics for geometric problems. We note that the doubly exponential dependency on $d$ is necessary as $k$-means is APX-hard in dimension $d = \omega(\log n)$.

[1]  Nabil H. Mustafa,et al.  Improved Results on Geometric Hitting Set Problems , 2010, Discret. Comput. Geom..

[2]  Claire Mathieu,et al.  Effectiveness of Local Search for Geometric Optimization , 2015, SoCG.

[3]  Satish Rao,et al.  A Nearly Linear-Time Approximation Scheme for the Euclidean k-Median Problem , 2007, SIAM J. Comput..

[4]  Pranjal Awasthi,et al.  Improved Spectral-Norm Bounds for Clustering , 2012, APPROX-RANDOM.

[5]  Andreas Krause,et al.  Scalable Training of Mixture Models via Coresets , 2011, NIPS.

[6]  Nabil H. Mustafa,et al.  PTAS for geometric hitting set problems via local search , 2009, SCG '09.

[7]  Sanjeev Arora,et al.  Nearly linear time approximation schemes for Euclidean TSP and other geometric problems , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[8]  Maria-Florina Balcan,et al.  Clustering under Perturbation Resilience , 2011, SIAM J. Comput..

[9]  Marek Cygan,et al.  Improved Approximation for 3-Dimensional Matching via Bounded Pathwidth Local Search , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[10]  Shi Li,et al.  Approximating k-median via pseudo-approximation , 2012, STOC '13.

[11]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[12]  Gary R. Weckman,et al.  The discrete Unconscious search and its application to uncapacitated facility location problem , 2014, Comput. Ind. Eng..

[13]  Vincent Cohen-Addad,et al.  On the Local Structure of Stable Clustering Instances , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[14]  Sencun Zhu,et al.  Towards event source unobservability with minimum network traffic in sensor networks , 2008, WiSec '08.

[15]  Maria-Florina Balcan,et al.  Approximate clustering without the approximation , 2009, SODA.

[16]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[17]  Sayan Bandyapadhyay,et al.  On Variants of k-means Clustering , 2015, SoCG.

[18]  Mohammad R. Salavatipour,et al.  Local Search Yields a PTAS for k-Means in Doubling Metrics , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[19]  Matt Gibson,et al.  Guarding Terrains via Local Search , 2014, J. Comput. Geom..

[20]  Ola Svensson,et al.  Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[21]  Pierre Hansen,et al.  Variable neighborhood search: Principles and applications , 1998, Eur. J. Oper. Res..

[22]  Mary Inaba,et al.  Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[23]  Michal Pilipczuk,et al.  Optimal Parameterized Algorithms for Planar Facility Location Problems Using Voronoi Diagrams , 2015, ESA.

[24]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[25]  Bodo Manthey,et al.  Smoothed Analysis of the k-Means Method , 2011, JACM.

[26]  Ravishankar Krishnaswamy,et al.  The Hardness of Approximation of Euclidean k-Means , 2015, SoCG.

[27]  Timothy M. Chan,et al.  Approximation Algorithms for Maximum Independent Set of Pseudo-Disks , 2009, Discrete & Computational Geometry.

[28]  Aravind Srinivasan,et al.  An Improved Approximation for k-Median and Positive Correlation in Budgeted Optimization , 2014, SODA.

[29]  Sergei Vassilvitskii,et al.  Worst-case and Smoothed Analysis of the ICP Algorithm, with an Application to the k-means Method , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[30]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[31]  Amit Kumar,et al.  Linear-time approximation schemes for clustering problems in any dimensions , 2010, JACM.

[32]  Vladimir Braverman,et al.  New Frameworks for Offline and Streaming Coreset Constructions , 2016, ArXiv.

[33]  Yifeng Zhang,et al.  Tight Analysis of a Multiple-Swap Heurstic for Budgeted Red-Blue Median , 2016, ICALP.

[34]  Amit Kumar,et al.  Clustering with Spectral Norm and the k-Means Algorithm , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[35]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[36]  Maxim Sviridenko,et al.  A Bi-Criteria Approximation Algorithm for k-Means , 2015, APPROX-RANDOM.

[37]  Philip N. Klein,et al.  Local Search Yields Approximation Schemes for k-Means and k-Median in Euclidean and Minor-Free Metrics , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[38]  J. Matou On Approximate Geometric K-clustering , 1999 .

[39]  Guy E. Blelloch,et al.  Parallel approximation algorithms for facility-location problems , 2010, SPAA '10.

[40]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[41]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[42]  Diptesh Ghosh,et al.  Neighborhood search heuristics for the uncapacitated facility location problem , 2003, Eur. J. Oper. Res..

[43]  Konstantin Makarychev,et al.  Algorithms for stable and perturbation-resilient problems , 2017, STOC.

[44]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[45]  Andreas Krause,et al.  Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures , 2015, AISTATS.

[46]  Andreas Krause,et al.  Coresets for Nonparametric Estimation - the Case of DP-Means , 2015, ICML.

[47]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[48]  Amit Kumar,et al.  A simple linear time (1 + /spl epsiv/)-approximation algorithm for k-means clustering in any dimensions , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[49]  Pierre Hansen,et al.  J-MEANS: a new local search heuristic for minimum sum of squares clustering , 1999, Pattern Recognit..

[50]  Yves Crama,et al.  Local Search in Combinatorial Optimization , 2018, Artificial Neural Networks.

[51]  Sergei Vassilvitskii,et al.  How slow is the k-means method? , 2006, SCG '06.

[52]  Inderjit S. Dhillon,et al.  Iterative clustering of high dimensional text data augmented by local search , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[53]  Avrim Blum,et al.  Stability Yields a PTAS for k-Median and k-Means Clustering , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[54]  Sanjoy Dasgupta,et al.  Random projection trees for vector quantization , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[55]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[56]  Aditya Bhaskara,et al.  Distributed Balanced Clustering via Mapping Coresets , 2014, NIPS.

[57]  Pierre Alliez,et al.  Variational shape approximation , 2004, ACM Trans. Graph..

[58]  Laura I. Burke,et al.  A two-phase tabu search approach to the location routing problem , 1999, Eur. J. Oper. Res..

[59]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.