Comparing Clusterings - An Overview

As the amount of data we nowadays have to deal with becomes larger and larger, the methods that help us to detect structures in the data and to identify interesting subsets in the data become more and more important. One of these methods is clustering, i.e. segmenting a set of elements into subsets such that the elements in each subset are somehow ”similiar” to each other and elements of different subsets are ”unsimilar”. In the literature we can find a large variety of clustering algorithms, each having certain advantages but also certain drawbacks. Typical questions that arise in this context comprise:

[1]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[2]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[3]  Marina MeWi Comparing Clusterings , 2002 .

[4]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[5]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[6]  Ian T. Jolliffe,et al.  A Method for Comparing Two Hierarchical Clusterings: Comment , 1983 .

[7]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[8]  Marina Meila,et al.  An Experimental Comparison of Model-Based Clustering Methods , 2004, Machine Learning.

[9]  Alan Agresti,et al.  The Measurement of Classification Agreement: An Adjustment to the Rand Statistic for Chance Agreement , 1984 .

[10]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[11]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[12]  Ana L. N. Fred,et al.  Robust data clustering , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[13]  S. Dongen Performance criteria for graph clustering and Markov cluster experiments , 2000 .

[14]  Ludmila I. Kuncheva,et al.  Using diversity in cluster ensembles , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[15]  Tao Li,et al.  On combining multiple clusterings , 2004, CIKM '04.

[16]  B. Mirkin Eleven Ways to Look at the Chi-Squared Coefficient for Contingency Tables , 2001 .

[17]  Ka Yee Yeung,et al.  Details of the Adjusted Rand index and Clustering algorithms Supplement to the paper “ An empirical study on Principal Component Analysis for clustering gene expression data ” ( to appear in Bioinformatics ) , 2001 .