The Impact of Random Models on Clustering Similarity

Clustering is a central approach for unsupervised learning. After clustering is applied, the most fundamental analysis is to quantitatively compare clusterings. Such comparisons are crucial for the evaluation of clustering methods as well as other tasks such as consensus clustering. It is often argued that, in order to establish a baseline, clustering similarity should be assessed in the context of a random ensemble of clusterings. The prevailing assumption for the random clustering ensemble is the permutation model in which the number and sizes of clusters are fixed. However, this assumption does not necessarily hold in practice; for example, multiple runs of K-means clustering returns clusterings with a fixed number of clusters, while the cluster size distribution varies greatly. Here, we derive corrected variants of two clustering similarity measures (the Rand index and Mutual Information) in the context of two random clustering ensembles in which the number and sizes of clusters vary. In addition, we study the impact of one-sided comparisons in the scenario with a reference clustering. The consequences of different random models are illustrated using synthetic examples, handwriting recognition, and gene expression data. We demonstrate that the choice of random model can have a drastic impact on the ranking of similar clustering pairs, and the evaluation of a clustering method with respect to a random baseline; thus, the choice of random clustering model should be carefully justified.

[1]  J. Sethna Statistical Mechanics: Entropy, Order Parameters, and Complexity , 2021 .

[2]  T. Mansour Combinatorics of Set Partitions , 2012 .

[3]  Andrea Lancichinetti,et al.  Community detection algorithms: a comparative analysis: invited presentation, extended abstract , 2009, VALUETOOLS.

[4]  L. Hubert,et al.  Comparing partitions , 1985 .

[5]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[6]  Edoardo M. Airoldi,et al.  A Survey of Statistical Network Models , 2009, Found. Trends Mach. Learn..

[7]  Clara Pizzuti,et al.  Is normalized mutual information a fair measure for comparing community detection methods? , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[8]  Kristopher L. Kuhlman,et al.  mpmath: a Python library for arbitrary-precision floating-point arithmetic , 2017 .

[9]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[10]  Lawrence Hubert,et al.  The variance of the adjusted Rand index. , 2016, Psychological methods.

[11]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[12]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[13]  James Bailey,et al.  Adjusting for Chance Clustering Comparison Measures , 2015, J. Mach. Learn. Res..

[14]  Janice L. DuBien,et al.  Moments of Rand's C statistic in cluster analysis , 2004 .

[15]  Marina Meila,et al.  Comparing clusterings: an axiomatic view , 2005, ICML.

[16]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[17]  Fevzi Alimo Methods of Combining Multiple Classiiers Based on Diierent Representations for Pen-based Handwritten Digit Recognition , 1996 .

[18]  Yeong-Nan Yeh,et al.  Some Explanations of Dobinski's Formula , 1994 .

[19]  Kathleen Marchal,et al.  Evaluation of time profile reconstruction from complex two-color microarray designs , 2008, BMC Bioinformatics.

[20]  Leon Danon,et al.  Comparing community structure identification , 2005, cond-mat/0505245.

[21]  O. Dudko Statistical Mechanics: Entropy, Order Parameters, and Complexity , 2007 .

[22]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[23]  James Bailey,et al.  Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance , 2014, ICML.

[24]  Anbupalam Thalamuthu,et al.  Gene expression Evaluation and comparison of gene clustering methods in microarray analysis , 2006 .

[25]  Ahmed Albatineh,et al.  On Similarity Indices and Correction for Chance Agreement , 2006, J. Classif..

[26]  Ken Ono Research in the Mathematical Sciences: a new open access journal , 2014 .

[27]  Alexander Schliep,et al.  Clustering cancer gene expression data: a comparative study , 2008, BMC Bioinformatics.

[28]  X ZhengAlice,et al.  A Survey of Statistical Network Models , 2010 .

[29]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[30]  Ahmed Albatineh,et al.  Correcting Jaccard and other similarity indices for chance agreement in cluster analysis , 2011, Adv. Data Anal. Classif..

[31]  Nico M. Temme,et al.  Asymptotic estimates of Stirling numbers , 1993 .

[32]  Daniel M. Kane,et al.  Closed expressions for averages of set partition statistics , 2013, 1304.4309.

[33]  David M. W. Powers,et al.  Characterization and evaluation of similarity measures for pairs of clusterings , 2009, Knowledge and Information Systems.

[34]  Allan P. White,et al.  Technical Note: Bias in Information-Based Measures in Decision Tree Induction , 1994, Machine Learning.

[35]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[36]  Paul D. McNicholas,et al.  Model-based clustering of microarray expression data via latent Gaussian mixture models , 2010, Bioinform..

[37]  ThalamuthuAnbupalam,et al.  Evaluation and comparison of gene clustering methods in microarray analysis , 2006 .

[38]  Wei Zhong Liu,et al.  Bias in information-based measures in decision tree induction , 1994, Machine Learning.

[39]  Ian T. Jolliffe,et al.  A Method for Comparing Two Hierarchical Clusterings: Comment , 1983 .