Single Pass Seed Selection Algorithm for k-Means

Problem statement: The k-means method is one of the most widely used clustering techniques for various applications. However, the k -means often converges to local optimum and the result depends on the initial seeds. Inappropriate choice of initial seeds may yield poor results. k- means++ is a way of initializing k-means by choosin g initial seeds with specific probabilities. Due to the random selection of first seed and the minimum probable distance, the k-means++ also results different clusters in different runs in different n umber of iterations. Approach: In this study we proposed a method called Single Pass Seed Selection (SPSS) algorithm as modification to k-means++ to initialize first seed and probable distance for k-means++ based on the point which was close to more number of other points in the data set. Result: We evaluated its performance by applying on various datasets and compare with k-means++. The SPSS algorithm was a single pass algorithm yielding unique solution in less number of iterations when c ompared to k-means++. Experimental results on real data sets (4-60 dimensions, 27-10945 objects a nd 2-10 clusters) from UCI demonstrated the effectiveness of the SPSS in producing consistent c lustering results. Conclusion: SPSS performed well on high dimensional data sets. Its efficiency incre ased with the increase of features in the data set; particularly when number of features greater than 1 0 we suggested the proposed method.

[1]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[2]  M M Astrahan SPEECH ANALYSIS BY CLUSTERING, OR THE HYPERPHONEME METHOD , 1970 .

[3]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[4]  David G. Stork,et al.  Pattern Classification , 1973 .

[5]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[6]  Julius T. Tou,et al.  Pattern Recognition Principles , 1974 .

[7]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[8]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[9]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[10]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[11]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[12]  C.-C. Jay Kuo,et al.  A new initialization technique for generalized Lloyd iteration , 1994, IEEE Signal Processing Letters.

[13]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[14]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[15]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[16]  D. Botstein,et al.  The transcriptional program in the response of human fibroblasts to serum. , 1999, Science.

[17]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[18]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Ian Witten,et al.  Data Mining , 2000 .

[20]  Doulaye Dembélé,et al.  Fuzzy C-means Method for Clustering Microarray Data , 2003, Bioinform..

[21]  D. Pham,et al.  Selection of K in K-means clustering , 2005 .

[22]  Jian Pei,et al.  An interactive approach to mining gene expression data , 2005, IEEE Transactions on Knowledge and Data Engineering.

[23]  Abdel-Badeeh M. Salem,et al.  An efficient enhanced k-means clustering algorithm , 2006 .

[24]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[25]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[26]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[27]  S. Deelers,et al.  Enhancing K-Means Algorithm with Initial Cluster Centers Derived from Data Partitioning along the Data Axis with the Highest Variance , 2007 .

[28]  M. A. Dalal,et al.  A survey on clustering in data mining , 2011, ICWET.