Relative clustering validity criteria: A comparative overview

Many different relative clustering validity criteria exist that are very useful in practice as quantitative measures for evaluating the quality of data partitions, and new criteria have still been proposed from time to time. These criteria are endowed with particular features that may make each of them able to outperform others in specific classes of problems. In addition, they may have completely different computational requirements. Then, it is a hard task for the user to choose a specific criterion when he or she faces such a variety of possibilities. For this reason, a relevant issue within the field of clustering analysis consists of comparing the performances of existing validity criteria and, eventually, that of a new criterion to be proposed. In spite of this, the comparison paradigm traditionally adopted in the literature is subject to some conceptual limitations. The present paper describes an alternative, possibly complementary methodology for comparing clustering validity criteria and uses it to make an extensive comparison of the performances of 40 criteria over a collection of 962,928 partitions derived from five well-known clustering algorithms and 1080 different data sets of a given class of interest. A detailed review of the relative criteria under investigation is also provided that includes an original comparative asymptotic analysis of their computational complexities. This work is intended to be a complement of the classic study reported in 1985 by Milligan and Cooper as well as a thorough extension of a preliminary paper by the authors themselves.  2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3:

[1]  R. Sharan,et al.  CLICK: a clustering algorithm with applications to gene expression analysis. , 2000, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[2]  Alex Alves Freitas,et al.  A Survey of Evolutionary Algorithms for Clustering , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[3]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[4]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[5]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[6]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[7]  Irving John Good C129. An index of separateness of clusters and a permutation test for its statistical significance , 1982 .

[8]  RICHARD C. DUBES,et al.  How many clusters are best? - An experiment , 1987, Pattern Recognit..

[9]  A. Scott,et al.  Clustering methods based on likelihood ratio criteria. , 1971 .

[10]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Jianhong Wu,et al.  Data clustering - theory, algorithms, and applications , 2007 .

[12]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[13]  Enrique H. Ruspini,et al.  Numerical methods for fuzzy clustering , 1970, Inf. Sci..

[14]  G. W. Milligan,et al.  A monte carlo study of thirty internal criterion measures for cluster analysis , 1981 .

[15]  Michalis Vazirgiannis,et al.  Clustering validity checking methods: part II , 2002, SGMD.

[16]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[17]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[18]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[19]  Robert S. Hill,et al.  A Stopping Rule for Partitioning Dendrograms , 1980, Botanical Gazette.

[20]  H. P. Friedman,et al.  On Some Invariant Criteria for Grouping Data , 1967 .

[21]  S. Arnold A Test for Clusters , 1979 .

[22]  L. Hubert,et al.  Measuring the Power of Hierarchical Cluster Analysis , 1975 .

[23]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[24]  Gilbert Saporta,et al.  Comparing Two Partitions: Some Proposals and Experiments , 2002, COMPSTAT.

[25]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[26]  Ricardo J. G. B. Campello,et al.  On comparing two sequences of numbers and its applications to clustering analysis , 2009, Inf. Sci..

[27]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[28]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[29]  Lipo Wang,et al.  Data Mining With Computational Intelligence , 2006, IEEE Transactions on Neural Networks.

[30]  M. Kendall Elementary Statistics , 1945, Nature.

[31]  Ujjwal Maulik,et al.  Performance Evaluation of Some Clustering Algorithms and Validity Indices , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  F. Rohlf Methods of Comparing Classifications , 1974 .

[33]  C. F. Kossack,et al.  Rank Correlation Methods , 1949 .

[34]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[35]  Ujjwal Maulik,et al.  Validity index for crisp and fuzzy clusters , 2004, Pattern Recognit..

[36]  Ricardo J. G. B. Campello,et al.  Evolving clusters in gene-expression data , 2006, Inf. Sci..

[37]  Ricardo J. G. B. Campello,et al.  On the Comparison of Relative Clustering Validity Criteria , 2009, SDM.

[38]  Alain Guénoche,et al.  Comparison of Distance Indices Between Partitions , 2006, Data Science and Classification.

[39]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[40]  Geoffrey H. Ball,et al.  ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .

[41]  L. Hubert,et al.  Comparing partitions , 1985 .

[42]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[43]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[44]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[45]  James C. Bezdek,et al.  Some new indexes of cluster validity , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[46]  최재영,et al.  개선된 ISODATA 알고리즘을 이용한 공격 자동탐지 , 2010 .

[47]  David L. Wallace,et al.  A Method for Comparing Two Hierarchical Clusterings: Comment , 1983 .

[48]  L. Hubert,et al.  A general statistical framework for assessing categorical clustering in free recall. , 1976 .

[49]  F. Marriott Practical problems in a method of cluster analysis. , 1971, Biometrics.

[50]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[51]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[52]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.