Clustering Biological Data Using Enhanced k-Means Algorithm

With the advent of modern scientific methods for data collection, huge volumes of biological data are now getting accumulated at various data banks. The enormity of such data and the complexity of biological networks greatly increase the challenges of understanding and interpreting the underlying data. Effective and efficient Data Mining techniques are essential to unearth useful information from them. A first step towards addressing this challenge is the use of clustering techniques, which helps to recognize natural groupings and interesting patterns in the data-set under consideration. The classical k-means clustering algorithm is widely used for many practical applications. But it is computationally expensive and the accuracy of the final clusters is not guaranteed always. This paper proposes a heuristic method for improving the accuracy and efficiency of the k-means clustering algorithm. The modified algorithm is then applied for clustering biological data, the results of which are promising.

[1]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[2]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[3]  Paul E. Green,et al.  K-modes Clustering , 2001, J. Classif..

[4]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[5]  Abdel-Badeeh M. Salem,et al.  An efficient enhanced k-means clustering algorithm , 2006 .

[6]  Fang Yuan,et al.  A new algorithm to get the initial centroids , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[7]  JiangDaxin,et al.  Cluster Analysis for Gene Expression Data , 2004 .

[8]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[9]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[10]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .