A Modified k-means Algorithm to Avoid Empty Clusters

The k-means algorithm is one of the most widely used clustering algorithms and has been applied in many fields of science and technology. One of the major problems of the k-means algorithm is that it may produce empty clusters depending on initial center vectors. For static execution of the k-means, this problem is considered insignificant and can be solved by executing the algorithm for a number of times. In situations, where the k-means is used as an integral part of some higher level application, this empty cluster problem may produce anomalous behavior of the system and may lead to significant performance degradation. This paper presents a modified version of the k-means algorithm that efficiently eliminates this empty cluster problem. We have shown that the proposed algorithm is semantically equivalent to the original k-means and there is no performance degradation due to incorporated modification. Results of simulation experiments using several data sets prove our claim.

[1]  Shi-Jinn Horng,et al.  Parallel clustering algorithms on a reconfigurable array of processors with wider bus networks , 1997, Proceedings 1997 International Conference on Parallel and Distributed Systems.

[2]  Julius T. Tou,et al.  Pattern Recognition Principles , 1974 .

[3]  Ujjwal Maulik,et al.  Genetic algorithm-based clustering technique , 2000, Pattern Recognit..

[4]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[5]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[6]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Alva L. Couch,et al.  Parallel K-means Clustering Algorithm on NOWs , 2003 .

[9]  SANGHAMITRA BANDYOPADHYAY,et al.  Clustering Using Simulated Annealing with Probabilistic Redistribution , 2001, Int. J. Pattern Recognit. Artif. Intell..

[10]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[11]  Fang-Xiang Wu,et al.  Genetic weighted k-means algorithm for clustering large-scale gene expression data , 2008, BMC Bioinformatics.

[12]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[13]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[14]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[15]  S. Deelers,et al.  Enhancing K-Means Algorithm with Initial Cluster Centers Derived from Data Partitioning along the Data Axis with the Highest Variance , 2007 .

[16]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[17]  Shokri Z. Selim,et al.  A simulated annealing algorithm for the clustering problem , 1991, Pattern Recognit..

[18]  Ujjwal Maulik,et al.  A study of some fuzzy cluster validity indices, genetic clustering and application to pixel classification , 2005, Fuzzy Sets Syst..

[19]  Charles E. Heckler,et al.  Applied Multivariate Statistical Analysis , 2005, Technometrics.