Initializing K-means Batch Clustering: A Critical Evaluation of Several Techniques

K-means clustering is arguably the most popular technique for partitioning data. Unfortunately, K-means suffers from the well-known problem of locally optimal solutions. Furthermore, the final partition is dependent upon the initial configuration, making the choice of starting partitions all the more important. This paper evaluates 12 procedures proposed in the literature and provides recommendations for best practices.

[1]  Pierre Hansen,et al.  Analysis of Global k-Means, an Incremental Heuristic for Minimum Sum-of-Squares Clustering , 2005, J. Classif..

[2]  M. Brusco,et al.  A variable-selection heuristic for K-means clustering , 2001 .

[3]  G. W. Milligan,et al.  The validation of four ultrametric clustering algorithms , 1980, Pattern Recognit..

[4]  J. Donoghue Univariate Screening Measures for Cluster Analysis. , 1995, Multivariate behavioral research.

[5]  Boris Mirkin,et al.  Clustering For Data Mining: A Data Recovery Approach (Chapman & Hall/Crc Computer Science) , 2005 .

[6]  Douglas Steinley,et al.  Local optima in K-means clustering: what you don't know may hurt you. , 2003, Psychological methods.

[7]  J. Bezdek Cluster Validity with Fuzzy Sets , 1973 .

[8]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[9]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[10]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[11]  D. Steinley Properties of the Hubert-Arabie adjusted Rand index. , 2004, Psychological methods.

[12]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[13]  Wolfgang Gaul,et al.  "Classification, Clustering, and Data Mining Applications" , 2004 .

[14]  D. Steinley Profiling local optima in K-means clustering: developing a diagnostic technique. , 2006, Psychological methods.

[15]  M. Brusco Clustering binary data in the presence of masking variables. , 2004, Psychological methods.

[16]  M. Brusco,et al.  A Comparison of Heuristic Procedures for Minimum Within-Cluster Sums of Squares Partitioning , 2007 .

[17]  Saskia de Craen,et al.  Effects of Group Size and Lack of Sphericity on the Recovery of Clusters in K-means Cluster Analysis , 2006, Multivariate behavioral research.

[18]  Douglas Steinley,et al.  Standardizing Variables in K -means Clustering , 2004 .

[19]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[20]  G. W. Milligan,et al.  A study of standardization of variables in cluster analysis , 1988 .

[21]  Robert F. Ling,et al.  Cluster analysis algorithms for data reduction and classification of objects , 1981 .

[22]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[23]  Robert Henson,et al.  OCLUS: An Analytic Method for Generating Clusters with Known Overlap , 2005, J. Classif..

[24]  Douglas Steinley,et al.  K-means clustering: a half-century synthesis. , 2006, The British journal of mathematical and statistical psychology.

[25]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[26]  Niels G. Waller,et al.  A comparison of the classification capabilities of the 1-dimensional kohonen neural network with two pratitioning and three hierarchical cluster analysis algorithms , 1998 .