Computational cluster validation in post-genomic data analysis

MOTIVATION The discovery of novel biological knowledge from the ab initio analysis of post-genomic data relies upon the use of unsupervised processing methods, in particular clustering techniques. Much recent research in bioinformatics has therefore been focused on the transfer of clustering methods introduced in other scientific fields and on the development of novel algorithms specifically designed to tackle the challenges posed by post-genomic data. The partitions returned by a clustering algorithm are commonly validated using visual inspection and concordance with prior biological knowledge--whether the clusters actually correspond to the real structure in the data is somewhat less frequently considered. Suitable computational cluster validation techniques are available in the general data-mining literature, but have been given only a fraction of the same attention in bioinformatics. RESULTS This review paper aims to familiarize the reader with the battery of techniques available for the validation of clustering results, with a particular focus on their application to post-genomic data analysis. Synthetic and real biological datasets are used to demonstrate the benefits, and also some of the perils, of analytical clustervalidation. AVAILABILITY The software used in the experiments is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/. SUPPLEMENTARY INFORMATION Enlarged colour plots are provided in the Supplementary Material, which is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/.

[1]  Kathleen Marchal,et al.  Adaptive quality-based clustering of gene expression profiles , 2002, Bioinform..

[2]  David G. Stork,et al.  Pattern Classification , 1973 .

[3]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Alfonso Valencia,et al.  A hierarchical unsupervised growing neural network for clustering gene expression patterns , 2001, Bioinform..

[6]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[7]  H. Fawcett Manual of Political Economy , 1995 .

[8]  Paul E. Green,et al.  A cautionary note on using internal cross validation to select the number of clusters , 1999 .

[9]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[10]  Francisco Azuaje,et al.  An integrated tool for microarray data clustering and cluster validity assessment , 2004, SAC '04.

[11]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[12]  Qiang Ji,et al.  Camera calibration with genetic algorithms , 2001, IEEE Trans. Syst. Man Cybern. Part A.

[13]  Pedro Mendes,et al.  Artificial gene networks for objective comparison of analysis algorithms , 2003, ECCB.

[14]  H. Charles Romesburg,et al.  Cluster analysis for researchers , 1984 .

[15]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[16]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[17]  Dennis J. Michaud,et al.  eXPatGen: Generating Dynamic Expression Patterns for the Systematic Evaluation of Analytical Methods , 2003, Bioinform..

[18]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[19]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[20]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[21]  E. Voorhees The Effectiveness & Efficiency of Agglomerative Hierarchic Clustering in Document Retrieval , 1985 .

[22]  Z. Yakhini,et al.  Overabundance Analysis and Class Discovery in Gene Expression Data , 2001 .

[23]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[24]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[25]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[26]  D. Slonim From patterns to pathways: gene expression data analysis comes of age , 2002, Nature Genetics.

[27]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[28]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Cheng Li,et al.  Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application , 2001, Genome Biology.

[30]  Rainer Breitling,et al.  Iterative Group Analysis (iGA): A simple tool to enhance sensitivity and facilitate interpretation of microarray experiments , 2004, BMC Bioinformatics.

[31]  Sandrine Dudoit,et al.  Applications of Resampling Methods to Estimate the Number of Clusters and to Improve the Accuracy of , 2001 .

[32]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[33]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[34]  Michal Linial,et al.  The Advantage of Functional Prediction Based on Clustering of Yeast Genes and Its Correlation with Non-Sequence Based Classifications , 2002, J. Comput. Biol..

[35]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[36]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[37]  Anil K. Jain,et al.  Validity studies in clustering methodologies , 1979, Pattern Recognit..

[38]  D. Kell,et al.  Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic era. , 2004, BioEssays : news and reviews in molecular, cellular and developmental biology.

[39]  Petri Törönen,et al.  Selection of informative clusters from hierarchical cluster tree with gene classes , 2004, BMC Bioinformatics.

[40]  Vladimir Estivill-Castro,et al.  Why so many clustering algorithms: a position paper , 2002, SKDD.

[41]  Susmita Datta,et al.  Comparisons and validation of statistical clustering techniques for microarray gene expression data , 2003, Bioinform..

[42]  James C. Bezdek,et al.  Some new indexes of cluster validity , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[43]  Chris H. Q. Ding,et al.  K-nearest-neighbor consistency in data clustering: incorporating local information into global optimization , 2004, SAC '04.

[44]  Francisco Azuaje,et al.  Cluster validation techniques for genome expression data , 2003, Signal Process..

[45]  Peter J. Fleming,et al.  On the Performance Assessment and Comparison of Stochastic Multiobjective Optimizers , 1996, PPSN.

[46]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[47]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[48]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[49]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[50]  Alex M. Andrew,et al.  Modern Heuristic Search Methods , 1998 .

[51]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[52]  Robert Krauthgamer,et al.  Detecting protein sequence conservation via metric embeddings , 2003, ISMB.

[53]  J. Breckenridge,et al.  Validating Cluster Analysis: Consistent Replication and Symmetry , 2000, Multivariate behavioral research.

[54]  Partha S. Vasisht Computational Analysis of Microarray Data , 2003 .

[55]  Natalio Krasnogor,et al.  Measuring the similarity of protein structures by means of the universal similarity metric , 2004, Bioinform..

[56]  Ron Shamir,et al.  Scoring clustering solutions by their biological relevance , 2003, Bioinform..

[57]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[58]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[59]  Robert Tibshirani,et al.  Cluster Validation by Prediction Strength , 2005 .

[60]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[61]  Nitis Mukhopadhyay,et al.  Correlation Coefficient , 2011, International Encyclopedia of Statistical Science.

[62]  Joshua D. Knowles,et al.  Exploiting the Trade-off - The Benefits of Multiple Objectives in Data Clustering , 2005, EMO.

[63]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[64]  A Gordon,et al.  Classification, 2nd Edition , 1999 .

[65]  M. Eisen,et al.  Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering , 2002, Genome Biology.

[66]  S. Bandyopadhyay,et al.  Nonparametric genetic clustering: comparison of validity indices , 2001, IEEE Trans. Syst. Man Cybern. Syst..

[67]  Richard M. Simon,et al.  Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data , 2002, Bioinform..

[68]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[69]  J. Breckenridge Replicating Cluster Analysis: Method, Consistency, and Validity. , 1989, Multivariate behavioral research.

[70]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[71]  D B Kell,et al.  Rapid identification of urinary tract infection bacteria using hyperspectral whole-organism fingerprinting and artificial neural networks. , 1998, Microbiology.

[72]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[73]  Michal Linial,et al.  A functional hierarchical organization of the protein sequence space , 2004, BMC Bioinformatics.

[74]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[75]  Douglas B. Kell,et al.  Discrimination of the variety and region of origin of extra virgin olive oils using 13C NMR and multivariate calibration with variable reduction , 1997 .

[76]  James C. Bezdek,et al.  On cluster validity for the fuzzy c-means model , 1995, IEEE Trans. Fuzzy Syst..