NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set

Clustering is the partitioning of a set of objects into groups (clusters) so that objects within a group are more similar to each others than objects in different groups. Most of the clustering algorithms depend on some assumptions in order to define the subgroups present in a data set. As a consequence, the resulting clustering scheme requires some sort of evaluation as regards its validity. The evaluation procedure has to tackle difficult problems such as the quality of clusters, the degree with which a clustering scheme fits a specific data set and the optimal number of clusters in a partitioning. In the literature, a wide variety of indices have been proposed to find the optimal number of clusters in a partitioning of a data set during the clustering process. However, for most of indices proposed in the literature, programs are unavailable to test these indices and compare them. The R package NbClust has been developed for that purpose. It provides 30 indices which determine the number of clusters in a data set and it offers also the best clustering scheme from different results to the user. In addition, it provides a function to perform k-means and hierarchical clustering with different distance measures and aggregation methods. Any combination of validation indices and clustering methods can be requested in a single function call. This enables the user to simultaneously evaluate several clustering schemes while varying the number of clusters, to help determining the most appropriate number of clusters for the data set of interest.

[1]  M. Cugmas,et al.  On comparing partitions , 2015 .

[2]  Fionn Murtagh,et al.  Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion? , 2011, Journal of Classification.

[3]  Martin Maechler,et al.  Cluster Analysis Extended Rousseeuw et al , 2014 .

[4]  Marco Marozzi,et al.  Construction, dimension reduction and uncertainty analysis of an index of trust in public institutions , 2014 .

[5]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[6]  P. Legendre,et al.  Ward's Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm , 2011, ArXiv.

[7]  Guy N. Brock,et al.  clValid , an R package for cluster validation , 2008 .

[8]  Kurt Hornik,et al.  A CLUE for CLUster Ensembles , 2005 .

[9]  Stefano Tarantola,et al.  Uncertainty and sensitivity analysis techniques as tools for the quality assessment of composite indicators , 2005 .

[10]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[11]  S. Dolnicar,et al.  An examination of indexes for determining the number of clusters in binary data sets , 2002, Psychometrika.

[12]  Michalis Vazirgiannis,et al.  Clustering validity assessment: finding the optimal partitioning of a data set , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[13]  Michalis Vazirgiannis,et al.  Quality Scheme Assessment in the Clustering Process , 2000, PKDD.

[14]  Aidong Zhang,et al.  WaveCluster: a wavelet-based clustering approach for spatial data in very large databases , 2000, The VLDB Journal.

[15]  R. Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[16]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[17]  A Gordon,et al.  Classification, 2nd Edition , 1999 .

[18]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[19]  James C. Bezdek,et al.  Some new indexes of cluster validity , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[20]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[21]  L. Lebart,et al.  Statistique exploratoire multidimensionnelle , 1995 .

[22]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[23]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[24]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[25]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[26]  G. W. Milligan,et al.  A monte carlo study of thirty internal criterion measures for cluster analysis , 1981 .

[27]  G. W. Milligan,et al.  An examination of the effect of six types of error perturbation on fifteen clustering algorithms , 1980 .

[28]  Robert S. Hill,et al.  A Stopping Rule for Partitioning Dendrograms , 1980, Botanical Gazette.

[29]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[30]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  L. Hubert,et al.  A general statistical framework for assessing categorical clustering in free recall. , 1976 .

[32]  L. Hubert,et al.  Measuring the Power of Hierarchical Cluster Analysis , 1975 .

[33]  F. Rohlf Methods of Comparing Classifications , 1974 .

[34]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[35]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[36]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[37]  T. Frey,et al.  A Cluster Analysis of the D 2 Matrix of White Spruce Stands in Saskatchewan Based on the Maximum-Minimum Principle , 1972 .

[38]  F. Marriott Practical problems in a method of cluster analysis. , 1971, Biometrics.

[39]  A. Scott,et al.  Clustering methods based on likelihood ratio criteria. , 1971 .

[40]  Keinosuke Fukunaga,et al.  A Criterion and an Algorithm for Grouping Data , 1970, IEEE Transactions on Computers.

[41]  R C Durfee,et al.  A METHOD OF CLUSTER ANALYSIS. , 1970, Multivariate behavioral research.

[42]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[43]  H. P. Friedman,et al.  On Some Invariant Criteria for Grouping Data , 1967 .

[44]  J. Gower A comparison of some methods of cluster analysis. , 1967, Biometrics.

[45]  László Orlóci,et al.  An Agglomerative Method for Classification of Plant Communities , 1967 .

[46]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[47]  L. Mcquitty Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data , 1966 .

[48]  Geoffrey H. Ball,et al.  ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .

[49]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[50]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[51]  K. Florek,et al.  Sur la liaison et la division des points d'un ensemble fini , 1951 .

[52]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[53]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .