Improving the Accuracy and Efficiency of the k-means Clustering Algorithm

Emergence of modern techniques for scientific data collection has resulted in large scale accumulation of data per- taining to diverse fields. Conventional database querying methods are inadequate to extract useful information from huge data banks. Cluster analysis is one of the major data analysis methods and the k-means clustering algorithm is widely used for many practical applications. But the original k-means algorithm is computationally expensive and the quality of the resulting clusters heavily depends on the selection of initial centroids. Several methods have been proposed in the literature for improving the performance of the k-means clustering algorithm. This paper proposes a method for making the algorithm more effective and efficient, so as to get better clustering with reduced complexity.

[1]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[2]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[3]  JiangDaxin,et al.  Cluster Analysis for Gene Expression Data , 2004 .

[4]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[5]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[6]  Abdel-Badeeh M. Salem,et al.  An efficient enhanced k-means clustering algorithm , 2006 .

[7]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[8]  Hong Yan,et al.  Cluster Analysis of Gene Expression Data , 2009, Encyclopedia of Artificial Intelligence.

[9]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[10]  Paul E. Green,et al.  K-modes Clustering , 2001, J. Classif..

[11]  Fang Yuan,et al.  A new algorithm to get the initial centroids , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).