Investigating cluster validation metrics for optimal number of clusters determination

In short, clustering is the process of partitioning a given set of objects into groups containing highly related instances. This relation is determined by a specific distance metric with which the intra-cluster similarity is estimated. Finding an optimal number of such partitions is usually the key step in the entire process, yet a rather difficult one. Selecting an unsuitable number of clusters might lead to incorrect conclusions and, consequently, to wrong decisions: the term “optimal” is quite ambiguous. Furthermore, various inherent characteristics of the datasets, such as clusters that overlap or clusters containing subclusters, will most often increase the level of difficulty of the task. Thus, the methods used to detect similarities and the parameter selection of the partition algorithm have a major impact on the quality of the groups and the identification of their optimal number. Given that each dataset constitutes a rather distinct case, validity indices are indicators introduced to address the problem of selecting such an optimal number of clusters. In this work, an extensive set of well-known validity indices, based on the approach of the so-called relative criteria, are examined comparatively. A total of 26 cluster validation measures were investigated in two distinct case studies: one in real-world and one in artificially generated data. To ensure a certain degree of difficulty, both real-world and generated data were selected to exhibit variations and inhomogeneity. Each of the indices is being deployed under the schemes of 9 different clustering methods, which incorporate 5 different distance metrics. All results are presented in various explanatory forms.

[1]  Bruno A. Pimentel,et al.  A Meta-learning approach for recommending the number of clusters for clustering algorithms , 2020, Knowl. Based Syst..

[2]  Zhiwen Yu,et al.  A survey on ensemble learning , 2019, Frontiers of Computer Science.

[3]  Petros Xanthopoulos,et al.  Estimating the number of clusters in a dataset via consensus clustering , 2019, Expert Syst. Appl..

[4]  Chunhui Yuan,et al.  Research on K-Value Selection Method of K-Means Clustering Algorithm , 2019, J.

[5]  Channamma Patil,et al.  Estimating the Optimal Number of Clusters k in a Dataset Using Data Depth , 2019, Data Science and Engineering.

[6]  Tommi Kärkkäinen,et al.  Comparison of Internal Clustering Validation Indices for Prototype-Based Clustering , 2017, Algorithms.

[7]  Murat Erisoglu,et al.  An Approach for Determining the Number of Clusters in a Model-Based Cluster Analysis , 2017, Entropy.

[8]  François-Joseph Lapointe,et al.  Using the stability of objects to determine the number of clusters in datasets , 2017, Inf. Sci..

[9]  Sebastián Dormido,et al.  Determination of the optimal number of clusters using a spectral clustering optimization , 2016, Expert Syst. Appl..

[10]  Malika Charrad,et al.  NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set , 2014 .

[11]  Jiye Liang,et al.  Determining the number of clusters using information entropy for mixed data , 2012, Pattern Recognit..

[12]  Sandro Vega-Pons,et al.  A Survey of Clustering Ensemble Algorithms , 2011, Int. J. Pattern Recognit. Artif. Intell..

[13]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[14]  Eréndira Rendón,et al.  Internal versus External cluster validation indexes , 2011 .

[15]  A. Češka Estimation of the mean floristic similarity between and within sets of vegetational relevés , 2008, Folia Geobotanica et Phytotaxonomica.

[16]  Edward R. Dougherty,et al.  Model-based evaluation of clustering validation measures , 2007, Pattern Recognit..

[17]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[18]  Michalis Vazirgiannis,et al.  Clustering validity checking methods: part II , 2002, SGMD.

[19]  Vladimir Estivill-Castro,et al.  Why so many clustering algorithms: a position paper , 2002, SKDD.

[20]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[21]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[22]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[23]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[24]  L. Hubert,et al.  Comparing partitions , 1985 .

[25]  G. W. Milligan,et al.  An examination of the effect of six types of error perturbation on fifteen clustering algorithms , 1980 .

[26]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..

[28]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[29]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[30]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[31]  F. Marriott Practical problems in a method of cluster analysis. , 1971, Biometrics.

[32]  A. Scott,et al.  Clustering methods based on likelihood ratio criteria. , 1971 .

[33]  Keinosuke Fukunaga,et al.  A Criterion and an Algorithm for Grouping Data , 1970, IEEE Transactions on Computers.

[34]  J. Gower,et al.  Minimum Spanning Trees and Single Linkage Cluster Analysis , 1969 .

[35]  J. Gower A comparison of some methods of cluster analysis. , 1967, Biometrics.

[36]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[37]  L. Mcquitty Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data , 1966 .

[38]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[39]  K. Florek,et al.  Sur la liaison et la division des points d'un ensemble fini , 1951 .

[40]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .