The K-Means Algorithm Evolution

Clustering is one of the main methods for getting insight on the underlying nature and structure of data. The purpose of clustering is organizing a set of data into clusters, such that the elements in each cluster are similar and different from those in other clusters. One of the most used clustering algorithms presently is K-means, because of its easiness for interpreting its results and implementation. The solution to the K-means clustering problem is NP-hard, which justifies the use of heuristic methods for its solution. To date, a large number of improvements to the algorithm have been proposed, of which the most relevant were selected using systematic review methodology. As a result, 1125 documents on improvements were retrieved, and 79 were left after applying inclusion and exclusion criteria. The improvements selected were classified and summarized according to the algorithm steps: initialization, classification, centroid calculation, and convergence. It is remarkable that some of the most successful algorithm variants were found. Some articles on trends in recent years were included, concerning K-means improvements and its use in other areas. Finally, it is considered that the main improvements may inspire the development of new heuristics for K-means or other clustering algorithms.

[1]  Joaquín Pérez Ortega,et al.  A-means: improving the cluster assignment phase of k-means for Big Data , 2018, Int. J. Comb. Optim. Probl. Informatics.

[2]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[3]  Alexis Boukouvalas,et al.  What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm , 2016, PloS one.

[4]  Karl Pearson,et al.  ON THE COEFFICIENT OF RACIAL LIKENESS , 1926 .

[5]  Malay K. Pakhira,et al.  A Modified k-means Algorithm to Avoid Empty Clusters , 2009 .

[6]  Adil M. Bagirov,et al.  Fast modified global k-means algorithm for incremental cluster construction , 2011, Pattern Recognit..

[7]  Jim Z. C. Lai,et al.  Fast global k-means clustering using cluster membership and inequality , 2010, Pattern Recognit..

[8]  Daeryong Lee,et al.  Modified K-means algorithm for vector quantizer design , 1997, IEEE Signal Processing Letters.

[9]  Peter Wai-Ming Tsang,et al.  eXploratory K-Means: A new simple and efficient algorithm for gene clustering , 2012, Appl. Soft Comput..

[10]  Stuart A. Roberts,et al.  New methods for the initialisation of clusters , 1996, Pattern Recognit. Lett..

[11]  Edwin Diday,et al.  The dynamic clusters method in nonhierarchical clustering , 1973, International Journal of Computer & Information Sciences.

[12]  Nor Ashidi Mat Isa,et al.  Automated two-dimensional K-means clustering algorithm for unsupervised image segmentation , 2013, Comput. Electr. Eng..

[13]  Jing Wang,et al.  Fast approximate k-means via cluster closures , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  E. Diday Une nouvelle méthode en classification automatique et reconnaissance des formes la méthode des nuées dynamiques , 1971 .

[15]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[16]  Joaquín Pérez Ortega,et al.  Improving the Efficiency and Efficacy of the K-means Clustering Algorithm Through a New Convergence Condition , 2007, ICCSA.

[17]  C.-C. Jay Kuo,et al.  A new initialization technique for generalized Lloyd iteration , 1994, IEEE Signal Processing Letters.

[18]  Greg Hamerly,et al.  Accelerating Lloyd’s Algorithm for k -Means Clustering , 2015 .

[19]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[20]  Lu Liu,et al.  Improvement and parallelism of k-means clustering algorithm , 2005 .

[21]  Ja-Chen Lin,et al.  An accelerated K-means clustering algorithm using selection and erasure rules , 2012, Journal of Zhejiang University SCIENCE C.

[22]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[23]  Myung Jin Bae,et al.  An Improvement of Modified K-Means Algorithm for Vector Quantizer Design , 1997 .

[24]  刘璐,et al.  Improvement and Parallelism of k-Means Clustering Algorithm , 2005 .

[25]  Robert R. Sokal,et al.  Distance as a Measure of Taxonomic Similarity , 1961 .

[26]  Douglas Steinley,et al.  K-means clustering: a half-century synthesis. , 2006, The British journal of mathematical and statistical psychology.

[27]  Chun Sheng Li,et al.  Cluster Center Initialization Method for K-means Algorithm Over Data Sets with Two Clusters , 2011 .

[28]  Wesam M. Ashour,et al.  An Initialization Method for the K-means Algorithm using RNN and Coupling Degree , 2011 .

[29]  P. Dostrnann,et al.  Automatkche Klassifikation. Theoretische und praktische Methoden zur Gruppierung und Strukturierung von Daten. (Cluster-Analyse). Von H. H. Bock. Vandenhoeck & Ruprecht, Gottingen-Zürich 1974. 1. Aufl., 480 S., 54 Abb., Ln. DM 82,– , 1975 .

[30]  M. P. S Bhatia,et al.  Data clustering with modified K-means algorithm , 2011, 2011 International Conference on Recent Trends in Information Technology (ICRTIT).

[31]  Abdel-Badeeh M. Salem,et al.  An efficient enhanced k-means clustering algorithm , 2006 .

[32]  Mahmoud Taleb Beidokhti,et al.  Advances in Intelligent Systems and Computing , 2016 .

[33]  Shehroz S. Khan,et al.  Cluster center initialization algorithm for K-means clustering , 2004, Pattern Recognit. Lett..

[34]  D. Pham,et al.  Selection of K in K-means clustering , 2005 .

[35]  Qi Li,et al.  Two-Stage Clustering with k-Means Algorithm , 2011 .

[36]  Vipin Kumar,et al.  Trends in big data analytics , 2014, J. Parallel Distributed Comput..

[37]  Wesam M. Ashour,et al.  Efficient Data Clustering Algorithms: Improvements over Kmeans , 2013 .

[38]  Steven J. Phillips Acceleration of K-Means and Related Clustering Algorithms , 2002, ALENEX.

[39]  R. Jancey Multidimensional group analysis , 1966 .

[40]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[41]  D. Pham,et al.  An Incremental K-means algorithm , 2004 .

[42]  Madhu Yedla,et al.  Enhancing K-means Clustering Algorithm with Improved Initial Center , 2010 .

[43]  Fang Yuan,et al.  A new algorithm to get the initial centroids , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[44]  Shuai Jiang,et al.  A Simple and Fast Algorithm for Global K-means Clustering , 2010, 2010 Second International Workshop on Education Technology and Computer Science.

[45]  Yongjun Zhang,et al.  An Optimized Method for Selection of the Initial Centers of K-Means Clustering , 2013, IUKM.

[46]  Qinghua Hu,et al.  A heuristic approach to effective and efficient clustering on uncertain objects , 2014, Knowl. Based Syst..

[47]  Weixin Xie,et al.  An Efficient Global K-means Clustering Algorithm , 2011, J. Comput..

[48]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[49]  Joaquín Pérez Ortega,et al.  Early Classification: A New Heuristic to Improve the Classification Step of K-Means , 2013, SBBD.

[50]  Kristian Sabo,et al.  Analysis of the k-means algorithm in the case of data points occurring on the border of two or more clusters , 2014, Knowl. Based Syst..

[51]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[52]  Qi Li,et al.  Fast k-means algorithm clustering , 2011, ArXiv.

[53]  David J. Hand,et al.  Short communication: Optimising k-means clustering results with standard software packages , 2005 .

[54]  Sanjay Ranka,et al.  An effic ient k-means clustering algorithm , 1997 .

[55]  Michael N. Vrahatis,et al.  The New k-Windows Algorithm for Improving the k-Means Clustering Algorithm , 2002, J. Complex..

[56]  Victor Chukwudi Osamor,et al.  Reducing the Time Requirement of k-Means Algorithm , 2012, PloS one.

[57]  Jing Wang,et al.  Fast approximate k-means via cluster closures , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Chu-Sing Yang,et al.  A time-efficient pattern reduction algorithm for k-means clustering , 2011, Inf. Sci..

[59]  Athanasios V. Vasilakos,et al.  Big data analytics: a survey , 2015, Journal of Big Data.

[60]  Zhang Wentao,et al.  An Improved Semi-supervised Clustering algorithm based on Initial Center Points , 2012 .

[61]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[62]  Mothd Belal Al-Daoud A New Algorithm for Cluster Initialization , 2005, WEC.

[63]  Johan A. K. Suykens,et al.  Optimized Data Fusion for Kernel k-Means Clustering , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Ting Su,et al.  In search of deterministic methods for initializing K-means and Gaussian mixture clustering , 2007, Intell. Data Anal..

[65]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[66]  Ming Zhu,et al.  MIC-KMeans: A Maximum Information Coefficient Based High-Dimensional Clustering Algorithm , 2018, CSOS.

[67]  Michael K. Ng,et al.  Agglomerative Fuzzy K-Means Clustering Algorithm with Selection of Number of Clusters , 2008, IEEE Transactions on Knowledge and Data Engineering.

[68]  Bassam Hammo,et al.  New Efficient Strategy to Accelerate k-Means Clustering Algorithm , 2008 .

[69]  David Romero,et al.  Balancing effort and benefit of K-means clustering algorithms in Big Data realms , 2018, PloS one.

[70]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[71]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  M. P. Sebastian,et al.  Improving the Accuracy and Efficiency of the k-means Clustering Algorithm , 2009 .

[73]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[74]  Adriana Mexicano,et al.  The early stop heuristic: A new convergence criterion for K-means , 2016 .

[75]  Veronica S. Moertini,et al.  Enhancing parallel k-means using map reduce for discovering knowledge from big data , 2016, 2016 IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA).

[76]  Wesam M. Ashour,et al.  Efficient and Fast Initialization Algorithm for K- means Clustering , 2012 .

[77]  Stephen J. Redmond,et al.  A method for initialising the K-means clustering algorithm using kd-trees , 2007, Pattern Recognit. Lett..

[78]  Hans-Hermann Bock,et al.  Origins and extensions of the -means algorithm in cluster analysis. , 2008 .

[79]  M. Narasimha Murty,et al.  A near-optimal initial seed value selection in K-means means algorithm using a genetic algorithm , 1993, Pattern Recognit. Lett..

[80]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[81]  D. Napoleon,et al.  An efficient K-Means clustering algorithm for reducing time complexity using uniform distribution data points , 2010, Trendz in Information Sciences & Computing(TISC2010).

[82]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[83]  Don-Lin Yang,et al.  An efficient Fuzzy C-Means clustering algorithm , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[84]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[85]  SuTing,et al.  In search of deterministic methods for initializing K-means and Gaussian mixture clustering , 2007 .

[86]  Wesam M. Ashour,et al.  Initializing K-Means Clustering Algorithm using Statistical Information , 2011 .

[87]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[88]  Joaquín Pérez Ortega,et al.  An improvement to the K-means algorithm oriented to big data , 2015 .

[89]  Shyr-Shen Yu,et al.  Two improved k-means algorithms , 2017, Appl. Soft Comput..

[90]  Syed Fawad Hussain,et al.  A k-means based co-clustering (kCC) algorithm for sparse, high dimensional data , 2019, Expert Syst. Appl..

[91]  Julius T. Tou,et al.  Pattern Recognition Principles , 1974 .

[92]  Christian Sohler,et al.  Theoretical Analysis of the k-Means Algorithm - A Survey , 2016, Algorithm Engineering.

[93]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[94]  Gunter Saake,et al.  K-Means for Spherical Clusters with Large Variance in Sizes , 2008 .

[95]  Jim Z. C. Lai,et al.  A Fuzzy K-means Clustering Algorithm Using Cluster Center Displacement , 2009, J. Inf. Sci. Eng..

[96]  M. Emre Celebi,et al.  Improving the performance of k-means for color quantization , 2011, Image Vis. Comput..

[97]  Robert F. Ling,et al.  Cluster analysis algorithms for data reduction and classification of objects , 1981 .

[98]  M.P. Sebastian,et al.  Enhancing the K-means Clustering Algorithm by Using a O(n logn) Heuristic Method for Finding Better Initial Centroids , 2011, 2011 Second International Conference on Emerging Applications of Information Technology.

[99]  B. Eswara Reddy,et al.  A hybrid approach to speed-up the k-means clustering method , 2012, International Journal of Machine Learning and Cybernetics.

[100]  Douglas Steinley,et al.  Local optima in K-means clustering: what you don't know may hurt you. , 2003, Psychological methods.

[101]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[102]  Kai Zhang,et al.  Category Preferred Canopy-K-means based Collaborative Filtering algorithm , 2019, Future Gener. Comput. Syst..

[103]  Chu-Sing Yang,et al.  A Time efficient Pattern Reduction algorithm for k-means based clustering , 2007, 2007 IEEE International Conference on Systems, Man and Cybernetics.

[104]  Jiye Liang,et al.  Fast global k-means clustering based on local geometrical information , 2013, Inf. Sci..

[105]  Huayu Zhang,et al.  Improved K-means algorithm based on density Canopy , 2018, Knowl. Based Syst..

[106]  Aishan Wumaier,et al.  Study and Implementing K-mean Clustering Algorithm on English Text and Techniques to Find the Optimal Value of K , 2018, International Journal of Computer Applications.

[107]  Panos M. Pardalos,et al.  K-T.R.A.C.E: A kernel k-means procedure for classification , 2007, Comput. Oper. Res..

[108]  Joaquín Pérez Ortega,et al.  Improvement to the K-Means algorithm through a heuristics based on a bee honeycomb structure , 2012, 2012 Fourth World Congress on Nature and Biologically Inspired Computing (NaBIC).

[109]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[110]  Michael J. Laszlo,et al.  A genetic algorithm using hyper-quadtrees for low-dimensional k-means clustering , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.