K2-means for Fast and Accurate Large Scale Clustering

We propose k^2-means, a new clustering method which efficiently copes with large numbers of clusters and achieves low energy solutions. k^2-means builds upon the standard k-means (Lloyd's algorithm) and combines a new strategy to accelerate the convergence with a new low time complexity divisive initialization. The accelerated convergence is achieved through only looking at k_n nearest clusters and using triangle inequality bounds in the assignment step while the divisive initialization employs an optimal 2-clustering along a direction. The worst-case time complexity per iteration of our k^2-means is O(nk_nd+k^2d), where d is the dimension of the n data points and k is the number of clusters and usually n << k << k_n. Compared to k-means' O(nkd) complexity, our k^2-means complexity is significantly lower, at the expense of slightly increasing the memory complexity by O(nk_n+k^2). In our extensive experiments k^2-means is order(s) of magnitude faster than standard methods in computing accurate clusterings on several standard datasets and settings with hundreds of clusters and high dimensional data. Moreover, the proposed divisive initialization generally leads to clustering energies comparable to those achieved with the standard k-means++ initialization, while being significantly faster.

[1]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[2]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[3]  Yue Zhao,et al.  Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup , 2015, ICML.

[4]  David J. Kriegman,et al.  From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[6]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[7]  Daniel Boley,et al.  Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.

[8]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[9]  Jing Wang,et al.  Fast approximate k-means via cluster closures , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[12]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[13]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[14]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[15]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[16]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[19]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[20]  Carlo Zaniolo,et al.  A fast and accurate algorithm for unsupervised clustering around centroids , 2017, Inf. Sci..

[21]  Ting Su,et al.  A deterministic method for initializing K-means clustering , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[22]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[23]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[24]  David J. Kriegman,et al.  Acquiring linear subspaces for face recognition under variable lighting , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Keqiu Li,et al.  Efficient $k$ -Means++ Approximation with MapReduce , 2014, IEEE Trans. Parallel Distributed Syst..

[26]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[28]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[29]  Jonathan Drake,et al.  Accelerated k-means with adaptive distance bounds , 2012 .

[30]  Greg Hamerly,et al.  Making k-means Even Faster , 2010, SDM.