Systematic Analysis of Cluster Similarity Indices: How to Validate Validation Measures

There are many cluster similarity indices used to evaluate clustering algorithms, and choosing the best one for a particular task is usually an open problem. In this paper, we perform a thorough analysis of this problem: we develop a list of desirable properties (requirements) and theoretically verify which indices satisfy them. In particular, we investigate dozens of pair-counting indices and prove that none of them meet all the requirements. Based on our analysis, we propose using the arccosine of the correlation coefficient as a similarity measure and show that it satisfies almost all the requirements (except for one, which is still satisfied asymptotically). We illustrate the practical importance of our analysis via an online experiment within a major news aggregator system.

[1]  S. Holmes,et al.  TRACKING NETWORK DYNAMICS : A SURVEY OF DISTANCES AND SIMILARITY METRICS , 2018 .

[2]  Yang Lei,et al.  Ground truth bias in external cluster validity indices , 2016, Pattern Recognit..

[3]  Stijn van Dongen,et al.  Metric distances derived from cosine similarity and Pearson and Spearman correlations , 2012, ArXiv.

[4]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[5]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[6]  Yingjie Tian,et al.  A Comprehensive Survey of Clustering Algorithms , 2015, Annals of Data Science.

[7]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[8]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[9]  James Bailey,et al.  Adjusting for Chance Clustering Comparison Measures , 2015, J. Mach. Learn. Res..

[10]  Ahmed Albatineh,et al.  On Similarity Indices and Correction for Chance Agreement , 2006, J. Classif..

[11]  Sven Kosub,et al.  A note on the triangle inequality for the Jaccard distance , 2016, Pattern Recognit. Lett..

[12]  C. Tappert,et al.  A Survey of Binary Similarity and Distance Measures , 2010 .

[13]  Mark Girolami,et al.  Precision-Recall Balanced Topic Modelling , 2019, NeurIPS.

[14]  Lawrence Hubert Nominal scale response agreement as a generalized correlation , 1977 .

[15]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[16]  Hugh E. Williams,et al.  Strategies for minimising errors in hierarchical web categorisation , 2002, CIKM '02.

[17]  Jon M. Kleinberg,et al.  An Impossibility Theorem for Clustering , 2002, NIPS.

[18]  Shai Ben-David,et al.  Measures of Clustering Quality: A Working Set of Axioms for Clustering , 2008, NIPS.

[19]  M. Cugmas,et al.  On comparing partitions , 2015 .

[20]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2009, Information Retrieval.

[21]  James Bailey,et al.  Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance , 2014, ICML.

[22]  V. Batagelj,et al.  Comparing resemblance measures , 1995 .

[23]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[24]  Clara Pizzuti,et al.  Is normalized mutual information a fair measure for comparing community detection methods? , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[25]  Joydeep Ghosh,et al.  Relationship-based clustering and cluster ensembles for high-dimensional data mining , 2002 .

[26]  Shengjin Wang,et al.  Linkage Based Face Clustering via Graph Convolution Network , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  A. Arenas,et al.  Community analysis in social networks , 2004 .

[28]  Krys J. Kochut,et al.  A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques , 2017, ArXiv.

[29]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[30]  Elena Marchiori,et al.  Axioms for graph clustering quality functions , 2013, J. Mach. Learn. Res..