论文信息 - An Experiment of K-Means Initialization Strategies on Handwritten Digits Dataset

An Experiment of K-Means Initialization Strategies on Handwritten Digits Dataset

Clustering is an important unsupervised classification method which divides data into different groups based some similarity metrics. K-means becomes an increasing method for clustering and is widely used in different application. Centroid initialization strategy is the key step in K-means clustering. In general, K-means has three efficient initialization strategies to improve its performance i.e., Random, K-means++ and PCA-based K-means. In this paper, we design an experiment to evaluate these three strategies on UCI ML hand-written digits dataset. The experiment result shows that the three K-means initialization strategies find out almost identical cluster centroids, and they have almost the same results of clustering, but the PCA-based K-means strategy significantly improves running time, and is faster than the other two strategies.

Boyang Li

[1] Patricio A. Vela,et al. A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm , 2012, Expert Syst. Appl..

[2] Trupti M. Kodinariya,et al. Survey on Exiting Method for Selecting Initial Centroids in K-means Clustering , 2014 .

[3] Paul S. Bradley,et al. Refining Initial Points for K-Means Clustering , 1998, ICML.

[4] David M. W. Powers,et al. Characterization and evaluation of similarity measures for pairs of clusterings , 2009, Knowledge and Information Systems.

[5] Greg Hamerly,et al. Alternatives to the k-means algorithm that find better clusterings , 2002, CIKM '02.

[6] Andrea Vattani. k-means Requires Exponentially Many Iterations Even in the Plane , 2011, Discret. Comput. Geom..

[7] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[8] Chris H. Q. Ding,et al. K-means clustering via principal component analysis , 2004, ICML.