Fast approximate spectral clustering

Spectral clustering refers to a flexible class of clustering procedures that can produce high-quality clusterings on small data sets but which has limited applicability to large-scale problems due to its computational complexity of O(n3) in general, with n the number of data points. We extend the range of spectral clustering by developing a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data. This framework is based on a theoretical analysis that provides a statistical characterization of the effect of local distortion on the mis-clustering rate. We develop two concrete instances of our general framework, one based on local k-means clustering (KASP) and one based on random projection trees (RASP). Extensive experiments show that these algorithms can achieve significant speedups with little degradation in clustering accuracy. Specifically, our algorithms outperform k-means by a large margin in terms of accuracy, and run several times faster than approximate spectral clustering based on the Nystrom method, with comparable accuracy and significantly smaller memory footprint. Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes.

[1]  J. Tukey A survey of sampling from contaminated distributions , 1960 .

[2]  W. Hoeffding The strong law of large numbers for u-statistics. , 1961 .

[3]  G. Stewart Introduction to matrix computations , 1973 .

[4]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[5]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[7]  Ing Rj Ser Approximation Theorems of Mathematical Statistics , 1980 .

[8]  R. Serfling Approximation Theorems of Mathematical Statistics , 1980 .

[9]  Paul L. Zador,et al.  Asymptotic quantization error of continuous signals and the quantization dimension , 1982, IEEE Trans. Inf. Theory.

[10]  Bruce Hendrickson,et al.  A Multi-Level Algorithm For Partitioning Graphs , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[11]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Axthonv G. Oettinger,et al.  IEEE Transactions on Information Theory , 1998 .

[13]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[14]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[15]  David L. Neuhoff,et al.  Quantization , 2022, IEEE Trans. Inf. Theory.

[16]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[17]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[18]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[19]  B. Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, ICML.

[20]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[21]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[22]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[23]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[24]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[25]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  C. A. Murthy,et al.  Density-Based Multiscale Data Condensation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Jianbo Shi,et al.  Multiclass spectral clustering , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[28]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[30]  Amit Kumar,et al.  A simple linear time (1 + /spl epsiv/)-approximation algorithm for k-means clustering in any dimensions , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[31]  Christian Posse,et al.  Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction , 2002, Data Mining and Knowledge Discovery.

[32]  Amit Kumar,et al.  A simple linear time ( 1+ ε)- approximation algorithm for geometric k-means clustering in any dimensions , 2004 .

[33]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[34]  Michael I. Jordan,et al.  Learning Spectral Clustering, With Application To Speech Separation , 2006, J. Mach. Learn. Res..

[35]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[37]  Stephen J. Redmond,et al.  A method for initialising the K-means clustering algorithm using kd-trees , 2007, Pattern Recognit. Lett..

[38]  S. V. N. Vishwanathan,et al.  Fast Iterative Kernel Principal Component Analysis , 2007, J. Mach. Learn. Res..

[39]  Sanjoy Dasgupta,et al.  Random projection trees and low dimensional manifolds , 2008, STOC.

[40]  Ling Huang,et al.  Spectral Clustering with Perturbed Data , 2008, NIPS.

[41]  Mikhail Belkin,et al.  Consistency of spectral clustering , 2008, 0804.0678.

[42]  Jing-Yu Yang,et al.  Hierarchical initialization approach for K-Means clustering , 2008, Pattern Recognit. Lett..

[43]  Alexander Kukush,et al.  Measurement Error Models , 2011, International Encyclopedia of Statistical Science.

[44]  Alexander Kukush,et al.  Measurement Error Models , 2011, International Encyclopedia of Statistical Science.