Effects of Group Size and Lack of Sphericity on the Recovery of Clusters in K-means Cluster Analysis

K-means cluster analysis is known for its tendency to produce spherical and equally sized clusters. To assess the magnitude of these effects, a simulation study was conducted, in which populations were created with varying departures from sphericity and group sizes. An analysis of the recovery of clusters in the samples taken from these populations showed a significant effect of lack of sphericity and group size. This effect was, however, not as large as expected, with still a recovery index of more than 0.5 in the "worst case scenario." An interaction effect between the two data aspects was also found. The decreasing trend in the recovery of clusters for increasing departures from sphericity is different for equal and unequal group sizes.

[1]  G. W. Milligan,et al.  CLUSTERING VALIDATION: RESULTS AND IMPLICATIONS FOR APPLIED ANALYSES , 1996 .

[2]  A. Scott,et al.  Clustering methods based on likelihood ratio criteria. , 1971 .

[3]  Phipps Arabie,et al.  AN OVERVIEW OF COMBINATORIAL DATA ANALYSIS , 1996 .

[4]  Richard P. Brent,et al.  Algorithm 488: A Gaussian pseudo-random number generator , 1974, Commun. ACM.

[5]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[6]  J. Donoghue THE EFFECTS OF WITHIN‐GROUP COVARIANCE STRUCTURE ON RECOVERY IN CLUSTER ANALYSIS: II. EXTENSION TO THE P‐DIMENSIONAL CASE , 1999 .

[7]  J. William Ahwood,et al.  CLASSIFICATION , 1931, Foundations of Familiar Language.

[8]  J. Donoghue,et al.  The Effects of Within-group Covariance Structure on Recovery in Cluster Analysis: I. The Bivariate Case. , 1995, Multivariate behavioral research.

[9]  G. W. Milligan,et al.  An examination of the effect of six types of error perturbation on fifteen clustering algorithms , 1980 .

[10]  Adrian E. Raftery,et al.  Enhanced Model-Based Clustering, Density Estimation, and Discriminant Analysis Software: MCLUST , 2003, J. Classif..

[11]  Craig Eldershaw,et al.  Cluster Analysis using Triangulation , 1997 .

[12]  Jan Palczewski,et al.  Monte Carlo Simulation , 2008, Encyclopedia of GIS.

[13]  C. Edelbrock Mixture Model Tests Of Hierarchical Clustering Algorithms: The Problem Of Classifying Everybody. , 1979, Multivariate behavioral research.

[14]  Robert C. Kohberger,et al.  Cluster Analysis (3rd ed.) , 1994 .

[15]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[16]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[17]  G. De Soete,et al.  Clustering and Classification , 2019, Data-Driven Science and Engineering.

[18]  Richard E. Strauss,et al.  Cluster analysis and the identification of aggregations , 2001, Animal Behaviour.

[19]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[20]  Hans-Hermann Bock,et al.  PROBABILITY MODELS AND HYPOTHESES TESTING IN PARTITIONING CLUSTER ANALYSIS , 1996 .

[21]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[22]  Roger K. Blashfield,et al.  Mixture model tests of cluster analysis: Accuracy of four agglomerative hierarchical methods. , 1976 .

[23]  Ka Yee Yeung,et al.  Details of the Adjusted Rand index and Clustering algorithms Supplement to the paper “ An empirical study on Principal Component Analysis for clustering gene expression data ” ( to appear in Bioinformatics ) , 2001 .

[24]  Douglas Steinley,et al.  Local optima in K-means clustering: what you don't know may hurt you. , 2003, Psychological methods.

[25]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[26]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[27]  Michael J. Symons,et al.  Clustering criteria and multivariate normal mixtures , 1981 .