Performance Analysis of K-Means Seeding Algorithms

K-Means is one of the most used cluster algorithms. However, because of its optimization process is based on a greedy iterated gradient descent, K-Means is sensitive to the initial set of centers. It has been proved that a bad initial set of centroids can reduce clusters’ quality. Therefore, numerous initialization methods have been developed to prevent a lousy performance of K-Means clustering. Nonetheless, we may notice that all of these initialization methods are usually validated by using the Sum of Squared Errors (SSE), as quality measurement. In this study, we evaluate three state-of-the-art initialization methods with three different quality measures, i.e., SSE, the Silhouette Coefficient, and the Adjusted Rand Index. The analysis is carried out with seventeen benchmarks. We provide new insight into the performance of initialization methods that traditionally are left behind; our results describe the high correlation between different initialization methods and fitness functions. These results may help to optimize K-Means for other topological structures beyond those covered by optimizing SSE with low effort.

[1]  Philip M. Long,et al.  Performance guarantees for hierarchical clustering , 2002, J. Comput. Syst. Sci..

[2]  Patricio A. Vela,et al.  A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm , 2012, Expert Syst. Appl..

[3]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[4]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[5]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[6]  Ashok Samal,et al.  Seed selection algorithm through K-means on optimal number of clusters , 2019, Multimedia Tools and Applications.

[7]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[8]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[9]  Jiye Liang,et al.  An initialization method for the K-Means algorithm using neighborhood model , 2009, Comput. Math. Appl..

[10]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[11]  H. Späth,et al.  Computational experiences with the exchange method , 1977 .

[12]  M. Cugmas,et al.  On comparing partitions , 2015 .

[13]  Mukesh Kumar,et al.  An optimized farthest first clustering algorithm , 2013, 2013 Nirma University International Conference on Engineering (NUiCONE).

[14]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[15]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[16]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[17]  C.-C. Jay Kuo,et al.  A new initialization technique for generalized Lloyd iteration , 1994, IEEE Signal Processing Letters.

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19]  Yasushi Kiyoki,et al.  A pillar algorithm for K-means optimization by distance maximization for initial centroid designation , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.