Characterization and evaluation of similarity measures for pairs of clusterings

In evaluating the results of cluster analysis, it is common practice to make use of a number of fixed heuristics rather than to compare a data clustering directly against an empirically derived standard, such as a clustering empirically obtained from human informants. Given the dearth of research into techniques to express the similarity between clusterings, there is broad scope for fundamental research in this area. In defining the comparative problem, we identify two types of worst-case matches between pairs of clusterings, characterised as independently codistributed clustering pairs and conjugate partition pairs. Desirable behaviour for a similarity measure in either of the two worst cases is discussed, giving rise to five test scenarios in which characteristics of one of a pair of clusterings was manipulated in order to compare and contrast the behaviour of different clustering similarity measures. This comparison is carried out for previously-proposed clustering similarity measures, as well as a number of established similarity measures that have not previously been applied to clustering comparison. We introduce a paradigm apparatus for the evaluation of clustering comparison techniques and distinguish between the goodness of clusterings and the similarity of clusterings by clarifying the degree to which different measures confuse the two. Accompanying this is the proposal of a novel clustering similarity measure, the Measure of Concordance (MoC). We show that only MoC, Powers’s measure, Lopez and Rajski’s measure and various forms of Normalised Mutual Information exhibit the desired behaviour under each of the test scenarios.

[1]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[2]  T. Sorgenfrei,et al.  Molluscan assemblages from the Marrine Middle Miocene of South Jutland and thire Environments , 1958 .

[3]  Cesare Baroni-Urbani,et al.  Similarity of Binary Data , 1976 .

[4]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[5]  Juni Palmgren,et al.  Analysis of binary traits: testing association in the presence of linkage , 2005, BMC genetics.

[6]  J. Braun-Blanquet,et al.  Plant Sociology: the Study of Plant Communities , 1983, Nature.

[7]  B. Mirkin Eleven Ways to Look at the Chi-Squared Coefficient for Contingency Tables , 2001 .

[8]  E. Michael,et al.  Marine Ecology and the Coefficient of Association: A Plea in Behalf of Quantitative Biology , 1920 .

[9]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[10]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[11]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[12]  Goodall Dw,et al.  The distribution of the matching coefficient. , 1967 .

[13]  Charles B. Heiser,et al.  Principles of Numerical Taxonomy Robert R. Sokal Peter H. A. Sneath , 1964 .

[14]  Ana L. N. Fred,et al.  Robust data clustering , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[15]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[16]  R. Maxwell Savage,et al.  The Breeding Behaviour of the Common Frog, Rana temporaria temporaria Linn., and of the Common Toad, Bufo bufo bufo Linn. , 1934 .

[17]  Tarald O. Kvålseth,et al.  Entropy and Correlation: Some Comments , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[18]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[19]  E. H. Linfoot An Informational Measure of Correlation , 1957, Inf. Control..

[20]  P. Arabie,et al.  Multidimensional scaling of measures of distance between partitions , 1973 .

[21]  S. K. Michael Wong,et al.  Rough Sets: Probabilistic versus Deterministic Approach , 1988, Int. J. Man Mach. Stud..

[22]  C B Bazzoni,et al.  THE SUCCESSIVE STIMULATION OF THE ARC LINES OF HELIUM BELOW THE IONIZATION POTENTIAL. , 1925, Science.

[23]  L. Thurstone A law of comparative judgment. , 1994 .

[24]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[25]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[26]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[27]  G. Yule On the Methods of Measuring Association between Two Attributes , 1912 .

[28]  R. Sokal,et al.  Principles of numerical taxonomy , 1965 .

[29]  Daniel P. Faith,et al.  Asymmetric binary similarity measures , 1983, Oecologia.

[30]  C. O. Nielsen,et al.  Progress in Soil Zoology , 1963 .

[31]  P Willett,et al.  Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings. , 2002, Combinatorial chemistry & high throughput screening.

[32]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[33]  Susan M. Haller,et al.  Measuring card sort orthogonality , 2005, Expert Syst. J. Knowl. Eng..

[34]  Yiyu Yao,et al.  On Information-Theoretic Measures of Attribute Importance , 1999, PAKDD.

[35]  Daniel A. Keim,et al.  A General Approach to Clustering in Large Databases with Noise , 2003, Knowledge and Information Systems.

[36]  Francesco M. Malvestuto,et al.  Statistical treatment of the information content of a database , 1986, Inf. Syst..

[37]  C. H. Coombs,et al.  Mathematical psychology : an elementary introduction , 1970 .

[38]  E W Fager,et al.  Zooplankton Species Groups in the North Pacific: Co-occurrences of species can be used to derive groups whose members react similarly to water-mass types. , 1963, Science.

[39]  P. F. Russell,et al.  On Habitat and Association of Species of Anopheline Larvae in South-eastern Madras. , 1940 .

[40]  Mountford An index of similarity and its application to classification problems , 1962 .

[41]  P H Sneath,et al.  Vigour and pattern in taxonomy. , 1968, Journal of general microbiology.

[42]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[43]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[44]  Steven Skiena,et al.  Heterogeneous Data Integration with the Consensus Clustering Formalism , 2004, DILS.

[45]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[46]  C. Rajski,et al.  A Metric Space of Discrete Probability Distributions , 1961, Inf. Control..

[47]  Yasuichi Horibe,et al.  Entropy and correlation , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[48]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[49]  T. Wells,et al.  Analysis of Quadrat Data , 1966 .

[50]  D J Rogers,et al.  A Computer Program for Classifying Plants. , 1960, Science.

[51]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[52]  D. W. Goodall,et al.  The distribution of the matching coefficient. , 1967, Biometrics.

[53]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[54]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[55]  Tony T. Lee,et al.  An Infornation-Theoretic Analysis of Relational Databases—Part I: Data Dependencies and Information Metric , 1987, IEEE Transactions on Software Engineering.

[56]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[57]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[58]  L. Hayek,et al.  Analysis of Amphibian Biodiversity Data , 1994 .

[59]  P. Jaccard Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines , 1901 .

[60]  L. Thurstone,et al.  A low of comparative judgement , 1927 .

[61]  J. Braun-Blanquet,et al.  PLANT SOCIOLOGY. A STUDY OF PLANT COMMUNITIES , 1934 .

[62]  Ian T. Jolliffe,et al.  A Method for Comparing Two Hierarchical Clusterings: Comment , 1983 .

[63]  Roger L. H. Dennis,et al.  Faunal structures among European butterflies: evolutionary implications of bias for geography, endemism and taxonomic affiliation , 1998 .

[64]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .