On strategies for building effective ensembles of relative clustering validity criteria

Evaluation and validation are essential tasks for achieving meaningful clustering results. Relative validity criteria are measures usually employed in practice to select and validate clustering solutions, as they enable the evaluation of single partitions and the comparison of partition pairs in relative terms based only on the data under analysis. There is a plethora of relative validity measures described in the clustering literature, thus making it difficult to choose an appropriate measure for a given application. One reason for such a variety is that no single measure can capture all different aspects of the clustering problem and, as such, each of them is prone to fail in particular application scenarios. In the present work, we take advantage of the diversity in relative validity measures from the clustering literature. Previous work showed that when randomly selecting different relative validity criteria for an ensemble (from an initial set of 28 different measures), one can expect with great certainty to only improve results over the worst criterion included in the ensemble. In this paper, we propose a method for selecting measures with minimum effectiveness and some degree of complementarity (from the same set of 28 measures) into ensembles, which show superior performance when compared to any single ensemble member (and not just the worst one) over a variety of different datasets. One can also expect greater stability in terms of evaluation over different datasets, even when considering different ensemble strategies. Our results are based on more than a thousand datasets, synthetic and real, from different sources.

[1]  Arthur Zimek,et al.  Density-Based Clustering Validation , 2014, SDM.

[2]  Charles L. A. Clarke,et al.  Reciprocal rank fusion outperforms condorcet and individual rank learning methods , 2009, SIGIR.

[3]  Ricardo J. G. B. Campello,et al.  Relative clustering validity criteria: A comparative overview , 2010, Stat. Anal. Data Min..

[4]  Pablo M. Granitto,et al.  How Many Clusters: A Validation Index for Arbitrary-Shaped Clusters , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  G. W. Milligan,et al.  A monte carlo study of thirty internal criterion measures for cluster analysis , 1981 .

[6]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[7]  Joydeep Ghosh,et al.  Cluster ensembles , 2011, Data Clustering: Algorithms and Applications.

[8]  Jianhong Wu,et al.  Data clustering - theory, algorithms, and applications , 2007 .

[9]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[10]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[11]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[12]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[13]  Vladimir Estivill-Castro,et al.  Why so many clustering algorithms: a position paper , 2002, SKDD.

[14]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[15]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[16]  Weiguo Sheng,et al.  A weighted sum validity function for clustering with a hybrid niching genetic algorithm , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[17]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[18]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[19]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[20]  Richard J. Wallace Determining the Basis for Performance Variations in CSP Heuristics , 2007 .

[21]  James C. Bezdek,et al.  Some new indexes of cluster validity , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[22]  Ricardo J. G. B. Campello,et al.  Automatic aspect discrimination in data clustering , 2012, Pattern Recognit..

[23]  Vasyl Pihur,et al.  RankAggreg, an R package for weighted rank aggregation , 2009, BMC Bioinformatics.

[24]  Hans-Peter Kriegel,et al.  Interpreting and Unifying Outlier Scores , 2011, SDM.

[25]  P.-C.-F. Daunou,et al.  Mémoire sur les élections au scrutin , 1803 .

[26]  C. Spearman The proof and measurement of association between two things. By C. Spearman, 1904. , 1987, The American journal of psychology.

[27]  Nicolas de Condorcet Essai Sur L'Application de L'Analyse a la Probabilite Des Decisions Rendues a la Pluralite Des Voix , 2009 .

[28]  Sven Laur,et al.  Robust rank aggregation for gene list integration and meta-analysis , 2012, Bioinform..

[29]  Ujjwal Maulik,et al.  Validity index for crisp and fuzzy clusters , 2004, Pattern Recognit..

[30]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[31]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[32]  Ricardo J. G. B. Campello,et al.  On the combination of relative clustering validity criteria , 2013, SSDBM.

[33]  Dan Roth,et al.  An Unsupervised Learning Algorithm for Rank Aggregation , 2007, ECML.

[34]  Robert S. Hill,et al.  A Stopping Rule for Partitioning Dendrograms , 1980, Botanical Gazette.

[35]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[36]  Vasyl Pihur,et al.  Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach , 2007, Bioinform..

[37]  Ricardo J. G. B. Campello,et al.  Design of OBF-TS Fuzzy Models Based on Multiple Clustering Validity Criteria , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[38]  Ricardo J. G. B. Campello,et al.  Evolving clusters in gene-expression data , 2006, Inf. Sci..

[39]  Nikunj C. Oza,et al.  Online Ensemble Learning , 2000, AAAI/IAAI.

[40]  L. Hubert,et al.  A general statistical framework for assessing categorical clustering in free recall. , 1976 .

[41]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[42]  A. Zimek,et al.  On Using Class-Labels in Evaluation of Clusterings , 2010 .

[43]  Arthur Zimek,et al.  Ensembles for unsupervised outlier detection: challenges and research questions a position paper , 2014, SKDD.

[44]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[45]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[46]  Hans-Peter Kriegel,et al.  On Evaluation of Outlier Rankings and Outlier Scores , 2012, SDM.

[47]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Cluster ensemble selection based on relative validity indexes , 2012, Data Mining and Knowledge Discovery.

[48]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[49]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[50]  Martin Ester,et al.  Density‐based clustering , 2019, WIREs Data Mining Knowl. Discov..

[51]  Ricardo J. G. B. Campello,et al.  Relative Validity Criteria for Community Mining Algorithms , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[52]  P. Jaccard Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines , 1901 .

[53]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[54]  Arthur Zimek,et al.  Data perturbation for outlier detection ensembles , 2014, SSDBM '14.

[55]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Anke van Zuylen,et al.  Rank Aggregation: Together We're Strong , 2009, ALENEX.

[57]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[58]  M. Cugmas,et al.  On comparing partitions , 2015 .

[59]  Amparo Albalate,et al.  A Combination Approach to Cluster Validation Based on Statistical Quantiles , 2009, 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing.

[60]  Cha Zhang,et al.  Ensemble Machine Learning , 2012 .

[61]  Ujjwal Maulik,et al.  Performance Evaluation of Some Clustering Algorithms and Validity Indices , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[62]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[63]  Ricardo J. G. B. Campello,et al.  Improving the Efficiency of a Clustering Genetic Algorithm , 2004, IBERAMIA.

[64]  Ricardo J. G. B. Campello,et al.  On the Comparison of Relative Clustering Validity Criteria , 2009, SDM.

[65]  Francisco Azuaje,et al.  Cluster validation techniques for genome expression data , 2003, Signal Process..