论文信息 - Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup

Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup

This paper presents Yinyang K-means, a new algorithm for K-means clustering. By clustering the centers in the initial stage, and leveraging efficiently maintained lower and upper bounds between a point and centers, it more effectively avoids unnecessary distance calculations than prior algorithms. It significantly outperforms prior K-means algorithms consistently across all experimented data sets, cluster numbers, and machine configurations. The consistent, superior performance--plus its simplicity, user-control of overheads, and guarantee in producing the same clustering results as the standard K-means--makes Yinyang K-means a drop-in replacement of the classic K-means with an order of magnitude higher performance.

[1] Hongbin Zha,et al. Trinary-Projection Trees for Approximate Nearest Neighbor Search , 2014, IEEE Trans. Pattern Anal. Mach. Intell..

[2] William B. March,et al. MLPACK: a scalable C++ machine learning library , 2012, J. Mach. Learn. Res..

[3] P. Hansen,et al. A branch-and-cut SDP-based algorithm for minimum sum-of-squares clustering , 2008 .

[4] Jing Wang,et al. Fast approximate k-means via cluster closures , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5] Andrew W. Moore,et al. Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[6] Charles Elkan,et al. Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[7] A. Czumaj,et al. Sublinear-time approximation algorithms for clustering via random sampling , 2007 .

[8] Sergei Vassilvitskii,et al. Scalable K-Means++ , 2012, Proc. VLDB Endow..

[9] Qing He,et al. Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[10] Jonathan Drake,et al. Accelerated k-means with adaptive distance bounds , 2012 .

[11] Bodo Manthey,et al. k-Means Has Polynomial Smoothed Complexity , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[12] Philip S. Yu,et al. Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[13] D. Sculley,et al. Web-scale k-means clustering , 2010, WWW '10.

[14] Michael Isard,et al. Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[15] Greg Hamerly,et al. Making k-means Even Faster , 2010, SDM.

[16] Joseph E. Gonzalez,et al. GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[17] Sudipto Guha,et al. CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[18] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[19] Joseph M. Hellerstein,et al. GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[20] D.M. Mount,et al. An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..