Enhancing the K-means Clustering Algorithm by Using a O(n logn) Heuristic Method for Finding Better Initial Centroids

With the advent of modern techniques for scientific data collection, large quantities of data are getting accumulated at various databases. Systematic data analysis methods are necessary to extract useful information from rapidly growing data banks. Cluster analysis is one of the major data mining methods and the k-means clustering algorithm is widely used for many practical applications. But the original k-means algorithm is computationally expensive and the quality of the resulting clusters substantially relies on the choice of initial centroids. Several methods have been proposed in the literature for improving the performance of the k-means algorithm. This paper proposes an improvement on the classic k-means algorithm to produce more accurate clusters. The proposed algorithm comprises of a O(n logn) heuristic method, based on sorting and partitioning the input data, for finding the initial centroids in accordance with the data distribution. Experimental results show that the proposed algorithm produces better clusters in less computation time.

[1]  JiangDaxin,et al.  Cluster Analysis for Gene Expression Data , 2004 .

[2]  D. Coomans,et al.  Comparison of Multivariate Discrimination Techniques for Clinical Data— Application to the Thyroid Functional State , 1983, Methods of Information in Medicine.

[3]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[4]  Abdel-Badeeh M. Salem,et al.  An efficient enhanced k-means clustering algorithm , 2006 .

[5]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[6]  Fang Yuan,et al.  A new algorithm to get the initial centroids , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[7]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[8]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[9]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[10]  Zohar Yakhini,et al.  Clustering gene expression patterns , 1999, J. Comput. Biol..

[11]  M. P. Sebastian,et al.  Improving the Accuracy and Efficiency of the k-means Clustering Algorithm , 2009 .

[12]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[13]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[14]  Paul E. Green,et al.  K-modes Clustering , 2001, J. Classif..

[15]  Paul Horton,et al.  A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins , 1996, ISMB.

[16]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.