More work on K -Means Clustering Algorithm: The Dimensionality Problem

The K-means clustering algorithm is an old algorithm that has been intensely researched owing to its simplicity of implementation. However, there have also been criticisms on its performance, in particular, for demanding the value of K a priori. It is evident from previous researches that providing the number of clusters a priori does not in any way assist in the production of good quality clusters. The objective of this paper is to investigate the usefulness of the K-means clustering in the clustering of high and multi-dimensional data by applying it to biological sequence data which is known for high and multi-dimension. The squared-Euclidean distance and the cosine measure are used as the similarity measures. The silhouette validity index is used first to show K-means algorithm‟s inefficiency in the clustering of high and multidimensional data irrespective of the distance or similarity measure employed. A further study was to introduce a preprocessor scheme to the K-means algorithm to automatically initialize a suitable value of K prior to the execution of the K-mean algorithm. The dimensionality problem investigated suggests that the use of the preprocessor improves the quality of clusters significantly for the biological data sets considered. Furthermore, it is then shown that the Kmeans algorithm with preprocessor produces good quality, compact and well-separated clusters of the biological data obtained from a high-dimension-to-lowdimension mapping scheme introduced in the paper. General Terms K means, Clustering, Algorithm.

[1]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[2]  A. Z. Spector,et al.  Achieving application requirements , 1990 .

[3]  Julius T. Tou,et al.  Dynoc—A dynamic optimal cluster-seeking technique , 1979, International Journal of Computer & Information Sciences.

[4]  D. Baker,et al.  Recurring local sequence motifs in proteins. , 1995, Journal of molecular biology.

[5]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[6]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[7]  Kathleen Marchal,et al.  Adaptive quality-based clustering of gene expression profiles , 2002, Bioinform..

[8]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[9]  David Anthony Binder Cluster analysis under parametric models , 1977 .

[10]  Yi Pan,et al.  Improved K-means clustering algorithm for exploring local protein sequence motifs representing common structural property , 2005, IEEE Transactions on NanoBioscience.

[11]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[12]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[13]  B. Jaumard,et al.  Cluster Analysis and Mathematical Programming , 2003 .

[14]  Francisco Azuaje,et al.  Cluster validation techniques for genome expression data , 2003, Signal Process..

[15]  James C. Bezdek,et al.  A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  K. Huang,et al.  A synergistic automatic clustering technique (SYNERACT) for multispectral image Analysis , 2002 .

[17]  Vasudha Bhatnagar,et al.  K-means Clustering Algorithm for Categorical Attributes , 1999, DaWaK.

[18]  Mahamed G. H. Omran Particle swarm optimization methods for pattern recognition and image processing , 2006 .

[19]  Francisco Azuaje,et al.  A cluster validity framework for genome expression data , 2002, Bioinform..

[20]  Christophe Rosenberger,et al.  Unsupervised clustering method with optimal estimation of the number of clusters: application to image segmentation , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.