The k-means Algorithm: A Comprehensive Survey and Performance Evaluation

The k-means clustering algorithm is considered one of the most powerful and popular data mining algorithms in the research community. However, despite its popularity, the algorithm has certain limitations, including problems associated with random initialization of the centroids which leads to unexpected convergence. Additionally, such a clustering algorithm requires the number of clusters to be defined beforehand, which is responsible for different cluster shapes and outlier effects. A fundamental problem of the k-means algorithm is its inability to handle various data types. This paper provides a structured and synoptic overview of research conducted on the k-means algorithm to overcome such shortcomings. Variants of the k-means algorithms including their recent developments are discussed, where their effectiveness is investigated based on the experimental analysis of a variety of datasets. The detailed experimental analysis along with a thorough comparison among different k-means clustering algorithms differentiates our work compared to other existing survey papers. Furthermore, it outlines a clear and thorough understanding of the k-means algorithm along with its different research directions.

[1]  Mugen Peng,et al.  A Data Mining Approach Combining $K$ -Means Clustering With Bagging Neural Network for Short-Term Wind Power Forecasting , 2017, IEEE Internet of Things Journal.

[2]  Min Ren,et al.  A Novel Virtual Sensing With Artificial Neural Network and K-Means Clustering for IGBT Current Measuring , 2018, IEEE Transactions on Industrial Electronics.

[3]  Yifan Tian,et al.  Practical Privacy-Preserving MapReduce Based K-Means Clustering Over Large-Scale Dataset , 2019, IEEE Transactions on Cloud Computing.

[4]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[5]  Mohiuddin Ahmed,et al.  Network traffic analysis based on collective anomaly detection , 2014, 2014 9th IEEE Conference on Industrial Electronics and Applications.

[6]  Wei Xing Zheng,et al.  Distributed $k$ -Means Algorithm and Fuzzy $c$ -Means Algorithm for Sensor Networks Based on Multiagent Consensus Theory , 2017, IEEE Transactions on Cybernetics.

[7]  Minsik Lee Non-alternating stochastic K-means based on probabilistic representation of solution space , 2019 .

[8]  Volodymyr Melnykov,et al.  On K-means algorithm with the use of Mahalanobis distances , 2014 .

[9]  Hsiao-Dong Chiang,et al.  Hierarchical K-means Method for Clustering Large-Scale Advanced Metering Infrastructure Data , 2017, IEEE Transactions on Power Delivery.

[10]  Marie-Jeanne Lesot,et al.  An Ellipsoidal K-Means for Document Clustering , 2012, 2012 IEEE 12th International Conference on Data Mining.

[11]  Junjie Wu,et al.  Spectral Ensemble Clustering via Weighted K-Means: Theoretical and Practical Evidence , 2017, IEEE Transactions on Knowledge and Data Engineering.

[12]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[13]  Wang Jiacai,et al.  An Extended Fuzzy k-Means Algorithm for Clustering Categorical Valued Data , 2010, 2010 International Conference on Artificial Intelligence and Computational Intelligence.

[14]  Jing Yang,et al.  Tag clustering algorithm LMMSK: improved K-means algorithm based on latent semantic analysis , 2017 .

[15]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Harald Cramér The elements of probability theory and some of its applications , 1955 .

[17]  G H Ball,et al.  A clustering technique for summarizing multivariate data. , 1967, Behavioral science.

[18]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[19]  Mohiuddin Ahmed,et al.  Infrequent pattern mining in smart healthcare environment using data summarization , 2018, The Journal of Supercomputing.

[20]  Karl Pearson,et al.  ON THE GENERAL THEORY OF MULTIPLE CONTINGENCY WITH SPECIAL REFERENCE TO PARTIAL CONTINGENCY , 1916 .

[21]  Miin-Shen Yang A survey of fuzzy clustering , 1993 .

[22]  Yike Guo,et al.  An Ensemble Clusterer of Multiple Fuzzy $k$ -Means Clusterings to Recognize Arbitrarily Shaped Clusters , 2018, IEEE Transactions on Fuzzy Systems.

[23]  Iker Gondra,et al.  Potential- $K$- Means for Load Balancing and Cost Minimization in Mobile Recycling Network , 2017, IEEE Systems Journal.

[24]  Wojciech Kwedlo,et al.  A Hybrid MPI/OpenMP Parallelization of $K$ -Means Algorithms Accelerated Using the Triangle Inequality , 2019, IEEE Access.

[25]  M. Alhawarat,et al.  Revisiting K-Means and Topic Modeling, a Comparison Study to Cluster Arabic Documents , 2018, IEEE Access.

[26]  Li He,et al.  Kernel K-Means Sampling for Nyström Approximation , 2018, IEEE Transactions on Image Processing.

[27]  Bing He,et al.  Fully convolution neural network combined with K-means clustering algorithm for image segmentation , 2018, International Conference on Digital Image Processing.

[28]  Julia Couto,et al.  Kernel K-Means for Categorical Data , 2005, IDA.

[29]  K. Maung,et al.  MEASUREMENT OF ASSOCIATION IN A CONTINGENCY TABLE WITH SPECIAL REFERENCE TO THE PIGMENTATION OF HAIR AND EYE COLOURS OF SCOTTISH SCHOOL CHILDREN , 1941 .

[30]  Feiping Nie,et al.  Re-Weighted Discriminatively Embedded $K$ -Means for Multi-View Clustering , 2017, IEEE Transactions on Image Processing.

[31]  Jiye Liang,et al.  The impact of cluster representatives on the convergence of the K-modes type clustering , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[33]  Wei Zheng,et al.  Self-paced Learning for K-means Clustering Algorithm , 2020, Pattern Recognit. Lett..

[34]  Mohiuddin Ahmed Data summarization: a survey , 2018, Knowledge and Information Systems.

[35]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[36]  Claudio Castellan,et al.  Automatic Initialization Methods for Photonic Components on a Silicon-Based Optical Switch , 2019, Applied Sciences.

[37]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[38]  Mohiuddin Ahmed,et al.  Collective Anomaly Detection Techniques for Network Traffic Analysis , 2018 .

[39]  Fei Yan,et al.  Fast Adaptive K-Means Subspace Clustering for High-Dimensional Data , 2019, IEEE Access.

[40]  Shai Ben-David,et al.  A Sober Look at Clustering Stability , 2006, COLT.

[41]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[42]  Teng Long,et al.  Fast and Robust RBF Neural Network Based on Global K-Means Clustering With Adaptive Selection Radius for Sound Source Angle Estimation , 2018, IEEE Transactions on Antennas and Propagation.

[43]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[44]  H. Chae,et al.  Sensitivity Enhancement of Dielectric Plasma Etching Endpoint Detection by Optical Emission Spectra With Modified $K$ -Means Cluster Analysis , 2017, IEEE Transactions on Semiconductor Manufacturing.

[45]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[46]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[47]  Xinwang Liu,et al.  Efficient Multiple Kernel k-Means Clustering With Late Fusion , 2019, IEEE Access.

[48]  Mohiuddin Ahmed,et al.  Anomaly Detection on Big Data in Financial Markets , 2017, 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[49]  Ramandeep Kaur,et al.  A Survey of Clustering Techniques , 2010 .

[50]  Vasudha Bhatnagar,et al.  K-means Clustering Algorithm for Categorical Attributes , 1999, DaWaK.

[51]  Md. Rafiqul Islam,et al.  A survey of anomaly detection techniques in financial domain , 2016, Future Gener. Comput. Syst..

[52]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[53]  Ulrike von Luxburg,et al.  How the initialization affects the stability of the $k$-means algorithm , 2009, 0907.5494.

[54]  Laurent Jacques,et al.  Quantized Compressive K-Means , 2018, IEEE Signal Processing Letters.

[55]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[56]  Ondrej Krejcar,et al.  Fuzzy K-Means Using Non-Linear S-Distance , 2019, IEEE Access.

[57]  Chien-Hsing Chou,et al.  Short Papers , 2001 .

[58]  Alfred Lenin Fred,et al.  AC coefficient and K-means cuckoo optimisation algorithm-based segmentation and compression of compound images , 2018, IET Image Process..

[59]  Yonghao Gu,et al.  Semi-Supervised K-Means DDoS Detection Method Using Hybrid Feature Selection Algorithm , 2019, IEEE Access.

[60]  H. Bozdogan Model selection and Akaike's Information Criterion (AIC): The general theory and its analytical extensions , 1987 .

[61]  Gu Yonghao,et al.  Semi-Supervised K-Means DDoS Detection Method Using Hybrid Feature Selection Algorithm , 2019 .

[62]  Shuigeng Zhou,et al.  DeepCluster: A General Clustering Framework Based on Deep Learning , 2017, ECML/PKDD.

[63]  Yuxia Li,et al.  K-means clustering algorithm based on improved Cuckoo search algorithm and its application , 2018, 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA).

[64]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[65]  Stephen Grossberg,et al.  A massively parallel architecture for a self-organizing neural pattern recognition machine , 1988, Comput. Vis. Graph. Image Process..

[66]  Francesco Masulli,et al.  A survey of kernel and spectral methods for clustering , 2008, Pattern Recognit..

[67]  Malay K. Pakhira,et al.  A Linear Time-Complexity k-Means Algorithm Using Cluster Shifting , 2014, 2014 International Conference on Computational Intelligence and Communication Networks.

[68]  Mohiuddin Ahmed,et al.  An Unsupervised Approach of Knowledge Discovery from Big Data in Social Network , 2017, EAI Endorsed Trans. Scalable Inf. Syst..

[69]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[70]  Syed Fawad Hussain,et al.  A k-means based co-clustering (kCC) algorithm for sparse, high dimensional data , 2019, Expert Syst. Appl..

[71]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[72]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[73]  Koushik Maharatna,et al.  Coordinate Rotation-Based Low Complexity $K$ -Means Clustering Architecture , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[74]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .