An examination of procedures for determining the number of clusters in a data set

A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters. To provide a variety of clustering solutions, the data sets were analyzed by four hierarchical clustering methods. External criterion measures indicated excellent recovery of the true cluster structure by the methods at the correct hierarchy level. Thus, the clustering present in the data was quite strong. The simulation results for the stopping rules revealed a wide range in their ability to determine the correct number of clusters in the data. Several procedures worked fairly well, whereas others performed rather poorly. Thus, the latter group of rules would appear to have little validity, particularly for data sets containing distinct clusters. Applied researchers are urged to select one or more of the better criteria. However, users are cautioned that the performance of some of the criteria may be data dependent.

[1]  R. L. Thorndike Who belongs in the family? , 1953 .

[2]  R. Sokal,et al.  Principles of numerical taxonomy , 1965 .

[3]  J. A. Gengerelli A method for detecting subgroups in a population and specifying their membership. , 1963, The Journal of psychology.

[4]  Geoffrey H. Ball,et al.  ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .

[5]  A W EDWARDS,et al.  A METHOD FOR CLUSTER ANALYSIS. , 1965, Biometrics.

[6]  Joseph Naus,et al.  Power Comparison of Two Tests of Non-Random Clustering , 1966 .

[7]  R. Jancey Multidimensional group analysis , 1966 .

[8]  D. W. Goodall,et al.  Hypothesis-testing in Classification , 1966, Nature.

[9]  László Orlóci,et al.  An Agglomerative Method for Classification of Plant Communities , 1967 .

[10]  H. P. Friedman,et al.  On Some Invariant Criteria for Grouping Data , 1967 .

[11]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[12]  A. Cohen,et al.  Estimation in Mixtures of Two Normal Distributions , 1967 .

[13]  J. Rubin Optimal classification into groups: an approach for solving the taxonomy problem. , 1967, Journal of theoretical biology.

[14]  J Zubin,et al.  ON THE METHODS AND THEORY OF CLUSTERING. , 1969, Multivariate behavioral research.

[15]  J. Hartigan,et al.  Percentage Points of a Test for Clusters , 1969 .

[16]  N. E. Day Estimating the components of a mixture of normal distributions , 1969 .

[17]  J. Wolfe PATTERN CLUSTERING BY MULTIVARIATE MIXTURE ANALYSIS. , 1970, Multivariate behavioral research.

[18]  Keinosuke Fukunaga,et al.  A Criterion and an Algorithm for Grouping Data , 1970, IEEE Transactions on Computers.

[19]  Joseph L. Fleiss,et al.  On the use of inverted factor analysis for generating typologies. , 1971 .

[20]  F. Marriott Practical problems in a method of cluster analysis. , 1971, Biometrics.

[21]  A. Scott,et al.  Clustering methods based on likelihood ratio criteria. , 1971 .

[22]  D. F. Andrews,et al.  PLOTS OF HIGH-DIMENSIONAL DATA , 1972 .

[23]  T. Frey,et al.  A Cluster Analysis of the D 2 Matrix of White Spruce Stands in Saskatchewan Based on the Maximum-Minimum Principle , 1972 .

[24]  D. A. Huffman,et al.  Development of New Pattern-Recognition Methods. , 1973 .

[25]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[26]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[27]  F. Rohlf Methods of Comparing Classifications , 1974 .

[28]  Brian Everitt,et al.  Cluster analysis , 1974 .

[29]  L. Hubert,et al.  Measuring the Power of Hierarchical Cluster Analysis , 1975 .

[30]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[31]  L. Hubert,et al.  A general statistical framework for assessing categorical clustering in free recall. , 1976 .

[32]  J. Hartigan Distribution Problems in Clustering , 1977 .

[33]  R. Mojena,et al.  Hierarchical Grouping Methods and Stopping Rules: An Evaluation , 1977, Comput. J..

[34]  P. Sneath A method for testing the distinctness of clusters: A test of the disjunction of two clusters in Euclidean space as measured by their overlap , 1977 .

[35]  Lawrence Hubert,et al.  The comparison and fitting of given classification schemes , 1977 .

[36]  Anil K. Jain,et al.  On the optimal number of features in the classification of multivariate Gaussian data , 1978, Pattern Recognit..

[37]  D. Binder Bayesian cluster analysis , 1978 .

[38]  J. Hartigan Asymptotic Distributions for Clustering Criteria , 1978 .

[39]  Anil K. Jain,et al.  Validity studies in clustering methodologies , 1979, Pattern Recognit..

[40]  Kerry L Lee,et al.  Multivariate Tests for Clusters , 1979 .

[41]  B. Everitt Unresolved Problems in Cluster Analysis , 1979 .

[42]  S. Arnold A Test for Clusters , 1979 .

[43]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  G. W. Milligan,et al.  An examination of the effect of six types of error perturbation on fifteen clustering algorithms , 1980 .

[45]  Leslie C. Morey,et al.  A Comparison of Four Clustering Methods Using MMPI Monte Carlo Data , 1980 .

[46]  Robert S. Hill,et al.  A Stopping Rule for Partitioning Dendrograms , 1980, Botanical Gazette.

[47]  G. W. Milligan,et al.  A Two-Stage Clustering Algorithm with Robust Recovery Characteristics , 1980 .

[48]  G. W. Milligan,et al.  A monte carlo study of thirty internal criterion measures for cluster analysis , 1981 .

[49]  B. Everitt A Monte Carlo Investigation Of The Likelihood Ratio Test For The Number Of Components In A Mixture Of Normal Distributions. , 1981, Multivariate behavioral research.

[50]  G. W. Milligan,et al.  A Review Of Monte Carlo Tests Of Cluster Analysis. , 1981, Multivariate behavioral research.

[51]  M. A. Wong,et al.  A Hybrid Clustering Method for Identifying High-Density Clusters , 1982 .

[52]  Irving John Good C129. An index of separateness of clusters and a permutation test for its statistical significance , 1982 .

[53]  G. W. Milligan,et al.  The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Warren S. Sarle,et al.  Cubic Clustering Criterion , 1983 .

[55]  Glenn W. Milligan,et al.  Characteristics of Four External Criterion Measures , 1983 .

[56]  Alan Agresti,et al.  The Measurement of Classification Agreement: An Adjustment to the Rand Statistic for Chance Agreement , 1984 .