论文信息 - Single Pass Seed Selection Algorithm for k-Means

Single Pass Seed Selection Algorithm for k-Means

Problem statement: The k-means method is one of the most widely used clustering techniques for various applications. However, the k -means often converges to local optimum and the result depends on the initial seeds. Inappropriate choice of initial seeds may yield poor results. k- means++ is a way of initializing k-means by choosin g initial seeds with specific probabilities. Due to the random selection of first seed and the minimum probable distance, the k-means++ also results different clusters in different runs in different n umber of iterations. Approach: In this study we proposed a method called Single Pass Seed Selection (SPSS) algorithm as modification to k-means++ to initialize first seed and probable distance for k-means++ based on the point which was close to more number of other points in the data set. Result: We evaluated its performance by applying on various datasets and compare with k-means++. The SPSS algorithm was a single pass algorithm yielding unique solution in less number of iterations when c ompared to k-means++. Experimental results on real data sets (4-60 dimensions, 27-10945 objects a nd 2-10 clusters) from UCI demonstrated the effectiveness of the SPSS in producing consistent c lustering results. Conclusion: SPSS performed well on high dimensional data sets. Its efficiency incre ased with the increase of features in the data set; particularly when number of features greater than 1 0 we suggested the proposed method.

[1] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .

[2] M M Astrahan. SPEECH ANALYSIS BY CLUSTERING, OR THE HYPERPHONEME METHOD , 1970 .

[3] Michael R. Anderberg,et al. Cluster Analysis for Applications , 1973 .

[4] David G. Stork,et al. Pattern Classification , 1973 .

[5] Richard O. Duda,et al. Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[6] Julius T. Tou,et al. Pattern Recognition Principles , 1974 .

[7] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[8] P. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[9] Anil K. Jain,et al. Algorithms for Clustering Data , 1988 .

[10] Allen Gersho,et al. Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[11] Ali S. Hadi,et al. Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[12] C.-C. Jay Kuo,et al. A new initialization technique for generalized Lloyd iteration , 1994, IEEE Signal Processing Letters.

[13] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[14] Paul S. Bradley,et al. Refining Initial Points for K-Means Clustering , 1998, ICML.

[15] D. Botstein,et al. Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[16] D. Botstein,et al. The transcriptional program in the response of human fibroblasts to serum. , 1999, Science.

[17] Hans-Peter Kriegel,et al. OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[18] D. Botstein,et al. Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[19] Ian Witten,et al. Data Mining , 2000 .

[20] Doulaye Dembélé,et al. Fuzzy C-means Method for Clustering Microarray Data , 2003, Bioinform..

[21] D. Pham,et al. Selection of K in K-means clustering , 2005 .

[22] Jian Pei,et al. An interactive approach to mining gene expression data , 2005, IEEE Transactions on Knowledge and Data Engineering.

[23] Abdel-Badeeh M. Salem,et al. An efficient enhanced k-means clustering algorithm , 2006 .

[24] Pavel Berkhin,et al. A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[25] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[26] Philip S. Yu,et al. Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[27] S. Deelers,et al. Enhancing K-Means Algorithm with Initial Cluster Centers Derived from Data Partitioning along the Data Axis with the Highest Variance , 2007 .

[28] M. A. Dalal,et al. A survey on clustering in data mining , 2011, ICWET.