Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup

This paper presents Yinyang K-means, a new algorithm for K-means clustering. By clustering the centers in the initial stage, and leveraging efficiently maintained lower and upper bounds between a point and centers, it more effectively avoids unnecessary distance calculations than prior algorithms. It significantly outperforms prior K-means algorithms consistently across all experimented data sets, cluster numbers, and machine configurations. The consistent, superior performance--plus its simplicity, user-control of overheads, and guarantee in producing the same clustering results as the standard K-means--makes Yinyang K-means a drop-in replacement of the classic K-means with an order of magnitude higher performance.

[1]  Hongbin Zha,et al.  Trinary-Projection Trees for Approximate Nearest Neighbor Search , 2014, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  William B. March,et al.  MLPACK: a scalable C++ machine learning library , 2012, J. Mach. Learn. Res..

[3]  P. Hansen,et al.  A branch-and-cut SDP-based algorithm for minimum sum-of-squares clustering , 2008 .

[4]  Jing Wang,et al.  Fast approximate k-means via cluster closures , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[6]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[7]  A. Czumaj,et al.  Sublinear-time approximation algorithms for clustering via random sampling , 2007 .

[8]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[9]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[10]  Jonathan Drake,et al.  Accelerated k-means with adaptive distance bounds , 2012 .

[11]  Bodo Manthey,et al.  k-Means Has Polynomial Smoothed Complexity , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[12]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[13]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[14]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Greg Hamerly,et al.  Making k-means Even Faster , 2010, SDM.

[16]  Joseph E. Gonzalez,et al.  GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[17]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[18]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[19]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[20]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..