An examination of indexes for determining the number of clusters in binary data sets

The problem of choosing the correct number of clusters is as old as cluster analysis itself. A number of authors have suggested various indexes to facilitate this crucial decision. One of the most extensive comparative studies of indexes was conducted by Milligan and Cooper (1985). The present piece of work pursues the same goal under different conditions. In contrast to Milligan and Cooper's work, the emphasis here is on high-dimensional empirical binary data. Binary artificial data sets are constructed to reflect features typically encountered in real-world data situations in the field of marketing research. The simulation includes 162 binary data sets that are clustered by two different algorithms and lead to recommendations on the number of clusters for each index under consideration. Index results are evaluated and their performance is compared and analyzed.

[1]  R. L. Thorndike Who belongs in the family? , 1953 .

[2]  J. A. Gengerelli A method for detecting subgroups in a population and specifying their membership. , 1963, The Journal of psychology.

[3]  Geoffrey H. Ball,et al.  ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .

[4]  László Orlóci,et al.  An Agglomerative Method for Classification of Plant Communities , 1967 .

[5]  H. P. Friedman,et al.  On Some Invariant Criteria for Grouping Data , 1967 .

[6]  J. Hazel,et al.  BINARY (PRESENCE-ABSENCE) SIMILARITY COEFFICIENTS , 1969 .

[7]  J. Wolfe PATTERN CLUSTERING BY MULTIVARIATE MIXTURE ANALYSIS. , 1970, Multivariate behavioral research.

[8]  Keinosuke Fukunaga,et al.  A Criterion and an Algorithm for Grouping Data , 1970, IEEE Transactions on Computers.

[9]  R C Durfee,et al.  A METHOD OF CLUSTER ANALYSIS. , 1970, Multivariate behavioral research.

[10]  David R. Cox The analysis of binary data , 1970 .

[11]  F. Marriott Practical problems in a method of cluster analysis. , 1971, Biometrics.

[12]  A. Scott,et al.  Clustering methods based on likelihood ratio criteria. , 1971 .

[13]  D. F. Andrews,et al.  PLOTS OF HIGH-DIMENSIONAL DATA , 1972 .

[14]  D. A. Huffman,et al.  Development of New Pattern-Recognition Methods. , 1973 .

[15]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[16]  L. Hubert,et al.  Measuring the Power of Hierarchical Cluster Analysis , 1975 .

[17]  Cesare Baroni-Urbani,et al.  Similarity of Binary Data , 1976 .

[18]  L. Hubert,et al.  A general statistical framework for assessing categorical clustering in free recall. , 1976 .

[19]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[20]  Bruce L. Stern,et al.  Research for Marketing Decisions , 1978 .

[21]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  G. W. Milligan,et al.  An examination of the effect of six types of error perturbation on fifteen clustering algorithms , 1980 .

[23]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[24]  G. W. Milligan,et al.  A monte carlo study of thirty internal criterion measures for cluster analysis , 1981 .

[25]  Z. Hubálek COEFFICIENTS OF ASSOCIATION AND SIMILARITY, BASED ON BINARY (PRESENCE‐ABSENCE) DATA: AN EVALUATION , 1982 .

[26]  Warren S. Sarle,et al.  Cubic Clustering Criterion , 1983 .

[27]  L. Fahrmeir,et al.  Multivariate statistische Verfahren , 1984 .

[28]  M. Aldenderfer Cluster Analysis , 1984 .

[29]  M. Aldenderfer,et al.  Cluster Analysis. Sage University Paper Series On Quantitative Applications in the Social Sciences 07-044 , 1984 .

[30]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[31]  John C. Gower,et al.  Measures of Similarity, Dissimilarity and Distance , 1985 .

[32]  A. McCutcheon,et al.  Latent Class Analysis , 2021, Encyclopedia of Autism Spectrum Disorders.

[33]  F. B. Baulieu A classification of presence/absence based dissimilarity coefficients , 1989 .

[34]  Xiaobo Li,et al.  A probabilistic measure of similarity for binary data in pattern recognition , 1989, Pattern Recognit..

[35]  Miin-Shen Yang,et al.  ON STOCHASTIC CONVERGENCE THEOREMS FOR THE FUZZY C-MEANS CLUSTERING PROCEDURE∗ , 1990 .

[36]  Eric S. Lander,et al.  The distribution of clusters in random graphs , 1990 .

[37]  Erkki Oja,et al.  Rival penalized competitive learning for clustering analysis, RBF net, and curve detection , 1993, IEEE Trans. Neural Networks.

[38]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[39]  Phipps Arabie,et al.  AN OVERVIEW OF COMBINATORIAL DATA ANALYSIS , 1996 .

[40]  Rabikar Chatterjee,et al.  Joint Segmentation on Distinct Interdependent Bases with Categorical Data , 1996 .

[41]  G. De Soete,et al.  Clustering and Classification , 2019, Data-Driven Science and Engineering.

[42]  M. Wedel,et al.  Market Segmentation: Conceptual and Methodological Foundations , 1997 .

[43]  Lei Xu,et al.  Bayesian Ying-Yang machine, clustering and number of clusters , 1997, Pattern Recognit. Lett..

[44]  Christian Buchta,et al.  A comparison of several cluster algorithms on artificial binary data [Part 1]. Scenarios from travel market segmentation [Part 2: Working Paper 19]. , 1998 .

[45]  S. Dolnicar,et al.  A Tale of Three Cities: Perceptual Charting for Analyzing Destination Imagess , 1998 .