Learning-Augmented k-means Clustering

k-means clustering is a well-studied problem due to its wide applicability. Unfortunately, there exist strong theoretical limits on the performance of any algorithm for the k-means problem on worst-case inputs. To overcome this barrier, we consider a scenario where “advice” is provided to help perform clustering. Specifically, we consider the k-means problem augmented with a predictor that, given any point, returns its cluster label in an approximately optimal clustering up to some, possibly adversarial, error. We present an algorithm whose performance improves along with the accuracy of the predictor, even though näıvely following the accurate predictor can still lead to a high clustering cost. Thus if the predictor is sufficiently accurate, we can retrieve a close to optimal clustering with nearly optimal runtime, breaking known computational barriers for algorithms that do not have access to such advice. We evaluate our algorithms on real datasets and show significant improvements in the quality of clustering.

[1]  S. Dasgupta The hardness of k-means clustering , 2008 .

[2]  Ola Svensson,et al.  Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[3]  Silvio Lattanzi,et al.  A Better k-means++ Algorithm via Local Search , 2019, ICML.

[4]  David P. Woodruff,et al.  Robust Communication-Optimal Distributed Clustering Algorithms , 2017, ICALP.

[5]  Konstantin Makarychev,et al.  Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering , 2018, STOC.

[6]  Johan Håstad,et al.  Some optimal inapproximability results , 2001, JACM.

[7]  Sergei Vassilvitskii,et al.  How slow is the k-means method? , 2006, SCG '06.

[8]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[9]  David P. Woodruff,et al.  Learning-Augmented Data Stream Algorithms , 2020, ICLR.

[10]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[11]  Russell Impagliazzo,et al.  Complexity of k-SAT , 1999, Proceedings. Fourteenth Annual IEEE Conference on Computational Complexity (Formerly: Structure in Complexity Theory Conference) (Cat.No.99CB36317).

[12]  Maria-Florina Balcan,et al.  Clustering with Interactive Feedback , 2008, ALT.

[13]  Arya Mazumdar,et al.  Clustering with Noisy Queries , 2017, NIPS.

[14]  Piotr Indyk,et al.  Learning-Based Frequency Estimation Algorithms , 2018, ICLR.

[15]  Ola Svensson,et al.  Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering , 2018, ICALP.

[16]  Alexandr Andoni,et al.  Approximate Nearest Neighbor Search in High Dimensions , 2018, Proceedings of the International Congress of Mathematicians (ICM 2018).

[17]  Piotr Indyk,et al.  Learning Space Partitions for Nearest Neighbor Search , 2019, ICLR.

[18]  Sanjoy Dasgupta,et al.  Interactive Bayesian Hierarchical Clustering , 2016, ICML.

[19]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[20]  Joydeep Ghosh,et al.  Relaxed Oracles for Semi-Supervised Clustering , 2017, ArXiv.

[21]  Maria-Florina Balcan,et al.  Local algorithms for interactive clustering , 2013, ICML.

[22]  Cordelia Schmid,et al.  Spreading vectors for similarity search , 2018, ICLR.

[23]  Muriel Medard,et al.  Same-Cluster Querying for Overlapping Clusters , 2019, NeurIPS.

[24]  Maria-Florina Balcan,et al.  Clustering under approximation stability , 2013, JACM.

[25]  Pradeep Ravikumar,et al.  A Unified Approach to Robust Mean Estimation , 2019, ArXiv.

[26]  J. A. Howe Improved Clustering with Augmented k-means , 2017, 1705.07592.

[27]  Jakub W. Pachocki,et al.  Geometric median in nearly linear time , 2016, STOC.

[28]  Ola Svensson,et al.  Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms , 2020, SIAM J. Comput..

[29]  Miroslav Chlebík,et al.  Complexity of approximating bounded variants of optimization problems , 2006, Theor. Comput. Sci..

[30]  Mary Inaba,et al.  Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[31]  Sariel Har-Peled,et al.  How Fast Is the k-Means Method? , 2005, SODA '05.

[32]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[33]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[34]  Vangelis Th. Paschos,et al.  Sub-exponential Approximation Schemes for CSPs: from Dense to Almost Sparse , 2015, STACS.

[35]  Volkan Cevher,et al.  Scalable Learning-Based Sampling Optimization for Compressive Dynamic MRI , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Le Song,et al.  2 Common Formulation for Greedy Algorithms on Graphs , 2018 .

[37]  Wei Liu,et al.  Learning to Hash for Indexing Big Data—A Survey , 2015, Proceedings of the IEEE.

[38]  Alexandros G. Dimakis,et al.  Compressed Sensing using Generative Models , 2017, ICML.

[39]  Shai Ben-David,et al.  Clustering with Same-Cluster Queries , 2016, NIPS.

[40]  Sanguthevar Rajasekaran,et al.  Fast Algorithms for Constant Approximation k-Means Clustering , 2010, Trans. Mach. Learn. Data Min..

[41]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[42]  Ronitt Rubinfeld,et al.  Learning-based Support Estimation in Sublinear Time , 2021, ICLR.

[43]  S. KarthikC.,et al.  Inapproximability of Clustering in Lp Metrics , 2019, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[44]  Olgica Milenkovic,et al.  Query K-means Clustering and the Double Dixie Cup Problem , 2018, NeurIPS.

[45]  Amit Kumar,et al.  Approximate Clustering with Same-Cluster Queries , 2017, ITCS.

[46]  Rafail Ostrovsky,et al.  Streaming k-means on well-clusterable data , 2011, SODA '11.

[47]  Euiwoong Lee,et al.  Improved and simplified inapproximability for k-means , 2015, Inf. Process. Lett..

[48]  Michael Mitzenmacher,et al.  A Model for Learned Bloom Filters and Optimizing by Sandwiching , 2018, NeurIPS.

[49]  Sanjoy Dasgupta How Fast Is k-Means? , 2003, COLT.

[50]  Google,et al.  Improving Online Algorithms via ML Predictions , 2018 .

[51]  Andrea Vattani,et al.  k-means Requires Exponentially Many Iterations Even in the Plane , 2008, SCG '09.

[52]  Richard G. Baraniuk,et al.  A deep learning approach to structured signal recovery , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[53]  Sergei Vassilvitskii,et al.  Competitive caching with machine learned advice , 2018, ICML.

[54]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[55]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .