Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data

In this paper we present a new methodology of class discovery and clustering validation tailored to the task of analyzing gene expression data. The method can best be thought of as an analysis approach, to guide and assist in the use of any of a wide range of available clustering algorithms. We call the new methodology consensus clustering, and in conjunction with resampling techniques, it provides for a method to represent the consensus across multiple runs of a clustering algorithm and to assess the stability of the discovered clusters. The method can also be used to represent the consensus over multiple runs of a clustering algorithm with random restart (such as K-means, model-based Bayesian clustering, SOM, etc.), so as to account for its sensitivity to the initial conditions. Finally, it provides for a visualization tool to inspect cluster number, membership, and boundaries. We present the results of our experiments on both simulated data and real gene expression data aimed at evaluating the effectiveness of the methodology in discovering biologically meaningful clusters.

[1]  H. Alker,et al.  On measuring inequality. , 1964, Behavioral science.

[2]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  J. Hartigan Asymptotic Distributions for Clustering Criteria , 1978 .

[4]  H. Bock On some significance tests in cluster analysis , 1985 .

[5]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[6]  Anil K. Jain,et al.  Bootstrap technique in cluster analysis , 1987, Pattern Recognit..

[7]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[8]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[9]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[10]  R. Tibshirani,et al.  An introduction to the bootstrap , 1993 .

[11]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[12]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[13]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[14]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[15]  L. Wasserman,et al.  Computing Bayes Factors by Combining Simulation and Asymptotic Approximations , 1997 .

[16]  Teuvo Kohonen,et al.  The self-organizing map , 1990, Neurocomputing.

[17]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[18]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[19]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Jill P. Mesirov,et al.  Class prediction and discovery using gene expression data , 2000, RECOMB '00.

[21]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[22]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[24]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[25]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[27]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[28]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[30]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[31]  A. Orth,et al.  Large-scale analysis of the human and mouse transcriptomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[33]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[34]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[35]  Paola Sebastiani,et al.  Cluster analysis of gene expression dynamics , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Matthew West,et al.  Bayesian factor regression models in the''large p , 2003 .

[37]  Erik D. Demaine,et al.  K-ary Clustering with Optimal Leaf Ordering for Gene Expression Data , 2002, WABI.

[38]  J. Downing,et al.  Classification of pediatric acute lymphoblastic leukemia by gene expression profiling. , 2003, Blood.

[39]  David Maxwell Chickering,et al.  Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables , 1997, Machine Learning.

[40]  Robert Tibshirani,et al.  Cluster Validation by Prediction Strength , 2005 .

[41]  K. Roeder,et al.  Journal of the American Statistical Association: Comment , 2006 .