Speeding up k-means by approximating Euclidean distances via block vectors

This paper introduces a new method to approximate Euclidean distances between points using block vectors in combination with the Holder inequality. By defining lower bounds based on the proposed approximation, cluster algorithms can be considerably accelerated without loss of quality. In extensive experiments, we show a considerable reduction in terms of computational time in comparison to standard methods and the recently proposed Yinyang k-means. Additionally we show that the memory consumption of the presented clustering algorithm does not depend on the number of clusters, which makes the approach suitable for large scale problems.

[1]  S. Dasgupta The hardness of k-means clustering , 2008 .

[2]  Andrea Vattani The hardness of k-means clustering in the plane , 2010 .

[3]  Jing Wang,et al.  Fast approximate k-means via cluster closures , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[5]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[6]  J. Mielikainen,et al.  A novel full-search vector quantization algorithm based on the law of cosines , 2002, IEEE Signal Processing Letters.

[7]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[8]  L. Torres,et al.  An improvement on codebook search for vector quantization , 1994, IEEE Trans. Commun..

[9]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[10]  Greg Hamerly,et al.  Making k-means Even Faster , 2010, SDM.

[11]  Alice X. Zheng,et al.  Fast top-k similarity queries via matrix compression , 2012, CIKM.

[12]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[13]  Abdel-Badeeh M. Salem,et al.  An efficient enhanced k-means clustering algorithm , 2006 .

[14]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[15]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[16]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[17]  Olli Nevalainen,et al.  A fast exact GLA based on code vector activity detection , 2000, IEEE Trans. Image Process..

[18]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[20]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[21]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[22]  Sanjay Ranka,et al.  An effic ient k-means clustering algorithm , 1997 .

[23]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  N. F. F. Ebecken,et al.  Data mining III , 2002 .

[25]  Ja-Chen Lin,et al.  Fast VQ encoding by an efficient kick-out condition , 2000, IEEE Trans. Circuits Syst. Video Technol..

[26]  Yue Zhao,et al.  Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup , 2015, ICML.

[27]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[28]  Jonathan Drake,et al.  Accelerated k-means with adaptive distance bounds , 2012 .