Clustering stability and ground truth: Numerical experiments

Stability has been considered an important property for evaluating clustering solutions. Nevertheless, there are no conclusive studies on the relationship between this property and the capacity to recover clusters inherent to data (“ground truth”). This study focuses on this relationship resorting to synthetic data generated under diverse scenarios (controlling relevant factors). Stability is evaluated using a weighted cross-validation procedure. Indices of agreement (corrected for agreement by chance) are used both to assess stability and external validation. The results obtained reveal a new perspective so far not mentioned in the literature. Despite the clear relationship between stability and external validity when a broad range of scenarios is considered, within-scenarios conclusions deserve our special attention: faced with a specific clustering problem (as we do in practice), there is no significant relationship between stability and the ability to recover data clusters.

[1]  Ricardo J. G. B. Campello,et al.  Relative clustering validity criteria: A comparative overview , 2010 .

[2]  Boris G. Mirkin,et al.  Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads , 2010, J. Classif..

[3]  André C. P. L. F. de Carvalho,et al.  Evaluation of Clustering Results: The Trade-off Bias-Variability , 2010 .

[4]  Gérard Govaert,et al.  Rmixmod: The R Package of the Model-Based Unsupervised, Supervised and Semi-Supervised Classification Mixmod Library , 2015 .

[5]  Ulrike von Luxburg,et al.  How the initialization affects the stability of the $k$-means algorithm , 2009, 0907.5494.

[6]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[7]  R. Blashfield,et al.  A Nearest-Centroid Technique for Evaluating the Minimum-Variance Clustering Procedure. , 1980 .

[8]  L. Hubert,et al.  Comparing partitions , 1985 .

[9]  Ranjan Maitra,et al.  Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms , 2010 .

[10]  Ulrike von Luxburg,et al.  Clustering Stability: An Overview , 2010, Found. Trends Mach. Learn..

[11]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[12]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[13]  Christian Hennig,et al.  Cluster-wise assessment of cluster stability , 2007, Comput. Stat. Data Anal..

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  Gérard Govaert,et al.  Model-based cluster and discriminant analysis with the MIXMOD software , 2006, Comput. Stat. Data Anal..

[16]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[17]  Yasuichi Horibe,et al.  Entropy and correlation , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[18]  Robert Henson,et al.  OCLUS: An Analytic Method for Generating Clusters with Known Overlap , 2005, J. Classif..

[19]  Claus Weihs,et al.  Classification as a Tool for Research , 2010 .

[20]  Alexander Kraskov,et al.  Published under the scientific responsability of the EUROPEAN PHYSICAL SOCIETY Incorporating , 2002 .

[21]  M. Warrens On Similarity Coefficients for 2×2 Tables and Correction for Chance , 2008, Psychometrika.

[22]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[23]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[24]  Shai Ben-David,et al.  Relating Clustering Stability to Properties of Cluster Boundaries , 2008, COLT.