On the Local Structure of Stable Clustering Instances

We study the classic k-median and k-means clustering objectives in the beyond-worst-case scenario. We consider three well-studied notions of structured data that aim at characterizing real-world inputs:• Distribution Stability (introduced by Awasthi, Blum, and Sheffet, FOCS 2010)• Spectral Separability (introduced by Kumar and Kannan, FOCS 2010)• Perturbation Resilience (introduced by Bilu and Linial, ICS 2010)We prove structural results showing that inputs satisfying at least one of the conditions are inherently local. Namely, for any such input, any local optimum is close both in term of structure and in term of objective value to the global optima.As a corollary we obtain that the widely-used Local Search algorithm has strong performance guarantees for both the tasks of recovering the underlying optimal clustering and obtaining a clustering of small cost. This is a significant step toward understanding the success of local search heuristics in clustering applications.

[1]  Minghe Sun Solving the uncapacitated facility location problem using tabu search , 2006, Comput. Oper. Res..

[2]  Venkatesan Guruswami,et al.  Embeddings and non-approximability of geometric problems , 2003, SODA '03.

[3]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[4]  Sanjeev Arora,et al.  Learning mixtures of arbitrary gaussians , 2001, STOC '01.

[5]  Aditya Bhaskara,et al.  Distributed Balanced Clustering via Mapping Coresets , 2014, NIPS.

[6]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[7]  Leonard J. Schulman,et al.  Clustering for edge-cost minimization (extended abstract) , 2000, STOC '00.

[8]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Nitin Garg,et al.  Analysis of k-Means++ for Separable Data , 2012, APPROX-RANDOM.

[10]  Mathieu Desbrun,et al.  Variational shape approximation , 2004, SIGGRAPH 2004.

[11]  Helton Hideraldo Bíscaro,et al.  Hand movement recognition for Brazilian Sign Language: A study using distance-based neural networks , 2009, 2009 International Joint Conference on Neural Networks.

[12]  Aravind Srinivasan,et al.  An Improved Approximation for k-Median and Positive Correlation in Budgeted Optimization , 2014, SODA.

[13]  Anirban Dasgupta,et al.  Spectral clustering with limited independence , 2007, SODA '07.

[14]  Santosh S. Vempala,et al.  Isotropic PCA and Affine-Invariant Clustering , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[15]  Mark Braverman,et al.  Finding Low Error Clusterings , 2009, COLT.

[16]  Maria-Florina Balcan,et al.  Clustering under approximation stability , 2013, JACM.

[17]  Sayan Bandyapadhyay,et al.  On Variants of k-means Clustering , 2015, SoCG.

[18]  Sanjoy Dasgupta,et al.  Random projection trees for vector quantization , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[19]  Mohammad R. Salavatipour,et al.  Local Search Yields a PTAS for k-Means in Doubling Metrics , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[20]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixture models , 2004, J. Comput. Syst. Sci..

[21]  Christian Sohler,et al.  A fast k-means implementation using coresets , 2006, SCG '06.

[22]  Alexander G. Gray,et al.  Automatic Derivation of Statistical Algorithms: The EM Family and Beyond , 2002, NIPS.

[23]  Amit Kumar,et al.  Clustering with Spectral Norm and the k-Means Algorithm , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[24]  Carl E. Rasmussen,et al.  Warped Gaussian Processes , 2003, NIPS.

[25]  Shang-Hua Teng,et al.  Smoothed analysis of algorithms: why the simplex algorithm usually takes polynomial time , 2001, STOC '01.

[26]  Vijay V. Vazirani,et al.  Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation , 2001, JACM.

[27]  Santosh S. Vempala,et al.  The Spectral Method for General Mixture Models , 2008, SIAM J. Comput..

[28]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[29]  Konstantin Makarychev,et al.  Algorithms for stable and perturbation-resilient problems , 2017, STOC.

[30]  Yves Crama,et al.  Local Search in Combinatorial Optimization , 2018, Artificial Neural Networks.

[31]  C. Greg Plaxton,et al.  The Online Median Problem , 1999, SIAM J. Comput..

[32]  Pierre Hansen,et al.  Variable neighborhood search: Principles and applications , 1998, Eur. J. Oper. Res..

[33]  Bodo Manthey,et al.  Smoothed Analysis of the k-Means Method , 2011, JACM.

[34]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[35]  Rajmohan Rajaraman,et al.  Analysis of a local search heuristic for facility location problems , 2000, SODA '98.

[36]  Shalev Ben-David,et al.  Data stability in clustering: A closer look , 2011, Theor. Comput. Sci..

[37]  Avrim Blum,et al.  Center-based clustering under perturbation stability , 2010, Inf. Process. Lett..

[38]  Maria-Florina Balcan,et al.  Clustering under Local Stability: Bridging the Gap between Worst-Case and Beyond Worst-Case Analysis , 2017, ArXiv.

[39]  Maria-Florina Balcan,et al.  Clustering under Perturbation Resilience , 2011, SIAM J. Comput..

[40]  Bernard Chazelle,et al.  The discrepancy method - randomness and complexity , 2000 .

[41]  Inderjit S. Dhillon,et al.  Iterative clustering of high dimensional text data augmented by local search , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[42]  Avrim Blum,et al.  Stability Yields a PTAS for k-Median and k-Means Clustering , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[43]  Shai Ben-David,et al.  Finding Meaningful Cluster Structure Amidst Background Noise , 2016, ALT.

[44]  Amin Saberi,et al.  A new greedy approach for facility location problems , 2002, STOC '02.

[45]  Pranjal Awasthi,et al.  Improved Spectral-Norm Bounds for Clustering , 2012, APPROX-RANDOM.

[46]  Sanjoy Dasgupta,et al.  A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians , 2007, J. Mach. Learn. Res..

[47]  J SchulmanLeonard,et al.  The effectiveness of lloyd-type methods for the k-means problem , 2013 .

[48]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[49]  Laura I. Burke,et al.  A two-phase tabu search approach to the location routing problem , 1999, Eur. J. Oper. Res..

[50]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[51]  Shi Li,et al.  Approximating k-median via pseudo-approximation , 2012, STOC '13.

[52]  Nimrod Megiddo,et al.  On the Complexity of Some Common Geometric Location Problems , 1984, SIAM J. Comput..

[53]  Amin Coja-Oghlan,et al.  Graph Partitioning via Adaptive Spectral Techniques , 2009, Combinatorics, Probability and Computing.

[54]  Sergei Vassilvitskii,et al.  Worst-case and Smoothed Analysis of the ICP Algorithm, with an Application to the k-means Method , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[55]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[56]  Frank McSherry,et al.  Spectral partitioning of random graphs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[57]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[58]  Amit Kumar,et al.  Linear-time approximation schemes for clustering problems in any dimensions , 2010, JACM.

[59]  Ravishankar Krishnaswamy,et al.  The Hardness of Approximation of Euclidean k-Means , 2015, SoCG.

[60]  J. Matou On Approximate Geometric K-clustering , 1999 .

[61]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[62]  Michael B. Cohen,et al.  Dimensionality Reduction for k-Means Clustering and Low Rank Approximation , 2014, STOC.

[63]  Anupam Gupta,et al.  Simpler Analyses of Local Search Algorithms for Facility Location , 2008, ArXiv.

[64]  Satish Rao,et al.  A Nearly Linear-Time Approximation Scheme for the Euclidean k-Median Problem , 2007, SIAM J. Comput..

[65]  Dimitris Achlioptas,et al.  On Spectral Learning of Mixtures of Distributions , 2005, COLT.

[66]  Samir Khuller,et al.  Greedy strikes back: improved facility location algorithms , 1998, SODA '98.

[67]  Gary R. Weckman,et al.  The discrete Unconscious search and its application to uncapacitated facility location problem , 2014, Comput. Ind. Eng..

[68]  Sencun Zhu,et al.  Towards event source unobservability with minimum network traffic in sensor networks , 2008, WiSec '08.

[69]  Oded Goldreich,et al.  On the theory of average case complexity , 1989, STOC '89.

[70]  B. Bollobás THE VOLUME OF CONVEX BODIES AND BANACH SPACE GEOMETRY (Cambridge Tracts in Mathematics 94) , 1991 .

[71]  Ola Svensson,et al.  Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[72]  Maria-Florina Balcan,et al.  Agnostic Clustering , 2009, ALT.

[73]  Sharath Raghvendra,et al.  Approximation and Streaming Algorithms for Projective Clustering via Random Projections , 2014, CCCG.

[74]  Philip N. Klein,et al.  Local Search Yields Approximation Schemes for k-Means and k-Median in Euclidean and Minor-Free Metrics , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[75]  G. Pisier The volume of convex bodies and Banach space geometry , 1989 .

[76]  Guy E. Blelloch,et al.  Parallel approximation algorithms for facility-location problems , 2010, SPAA '10.

[77]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[78]  Nathan Linial,et al.  Are Stable Instances Easy? , 2009, Combinatorics, Probability and Computing.

[79]  Yifeng Zhang,et al.  Tight Analysis of a Multiple-Swap Heurstic for Budgeted Red-Blue Median , 2016, ICALP.

[80]  Sudipto Guha,et al.  Improved Combinatorial Algorithms for Facility Location Problems , 2005, SIAM J. Comput..

[81]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[82]  Diptesh Ghosh,et al.  Neighborhood search heuristics for the uncapacitated facility location problem , 2003, Eur. J. Oper. Res..

[83]  Claire Mathieu,et al.  Effectiveness of Local Search for Geometric Optimization , 2015, SoCG.

[84]  Michael E. Saks,et al.  On the practically interesting instances of MAXCUT , 2012, STACS.

[85]  Mikhail Belkin,et al.  Polynomial Learning of Distribution Families , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[86]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[87]  Maria-Florina Balcan,et al.  Approximate clustering without the approximation , 2009, SODA.

[88]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[89]  Pierre Hansen,et al.  J-MEANS: a new local search heuristic for minimum sum of squares clustering , 1999, Pattern Recognit..

[90]  Rafail Ostrovsky,et al.  Streaming k-means on well-clusterable data , 2011, SODA '11.