Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach

MOTIVATION Biologists often employ clustering techniques in the explorative phase of microarray data analysis to discover relevant biological groupings. Given the availability of numerous clustering algorithms in the machine-learning literature, an user might want to select one that performs the best for his/her data set or application. While various validation measures have been proposed over the years to judge the quality of clusters produced by a given clustering algorithm including their biological relevance, unfortunately, a given clustering algorithm can perform poorly under one validation measure while outperforming many other algorithms under another validation measure. A manual synthesis of results from multiple validation measures is nearly impossible in practice, especially, when a large number of clustering algorithms are to be compared using several measures. An automated and objective way of reconciling the rankings is needed. RESULTS Using a Monte Carlo cross-entropy algorithm, we successfully combine the ranks of a set of clustering algorithms under consideration via a weighted aggregation that optimizes a distance criterion. The proposed weighted rank aggregation allows for a far more objective and automated assessment of clustering results than a simple visual inspection. We illustrate our procedure using one simulated as well as three real gene expression data sets from various platforms where we rank a total of eleven clustering algorithms using a combined examination of 10 different validation measures. The aggregate rankings were found for a given number of clusters k and also for an entire range of k. AVAILABILITY R code for all validation measures and rank aggregation is available from the authors upon request. SUPPLEMENTARY INFORMATION Supplementary information are available at http://www.somnathdatta.org/Supp/RankCluster/supp.htm.

[1]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[2]  Joshua D. Knowles,et al.  Evolutionary Multiobjective Clustering , 2004, PPSN.

[3]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[4]  Susmita Datta,et al.  Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes , 2006, BMC Bioinformatics.

[5]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[6]  Alfonso Valencia,et al.  A hierarchical unsupervised growing neural network for clustering gene expression patterns , 2001, Bioinform..

[7]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[8]  Joshua D. Knowles,et al.  Exploiting the Trade-off - The Benefits of Multiple Objectives in Data Clustering , 2005, EMO.

[9]  HandlJulia,et al.  Computational cluster validation in post-genomic data analysis , 2005 .

[10]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[11]  B. Frey,et al.  The functional landscape of mouse gene expression , 2004, Journal of biology.

[12]  Dirk P. Kroese,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning , 2004 .

[13]  R. Rubinstein Combinatorial Optimization, Cross-Entropy, Ants and Rare Events , 2001 .

[14]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[15]  Lothar Thiele,et al.  Proceedings of the 2nd international conference on Evolutionary multi-criterion optimization , 2003 .

[16]  Dirk P. Kroese,et al.  The Cross-Entropy Method , 2011, Information Science and Statistics.

[17]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[18]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[19]  D. Botstein,et al.  The transcriptional program of sporulation in budding yeast. , 1998, Science.

[20]  Keith Baggerly,et al.  Transcriptomic changes in human breast cancer progression as determined by serial analysis of gene expression , 2004, Breast Cancer Research.

[21]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[22]  Anil K. Jain,et al.  Multiobjective data clustering , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[23]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[24]  S. Uryasev,et al.  Stochastic optimization : Algorithms and Applications , 2001 .

[25]  R. Rubinstein The Cross-Entropy Method for Combinatorial and Continuous Optimization , 1999 .

[26]  Susmita Datta,et al.  Comparisons and validation of statistical clustering techniques for microarray gene expression data , 2003, Bioinform..

[27]  Reuven Y. Rubinstein,et al.  Optimization of computer simulation models with rare events , 1997 .

[28]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .