Cluster stability scores for microarray data in cancer studies

BackgroundA potential benefit of profiling of tissue samples using microarrays is the generation of molecular fingerprints that will define subtypes of disease. Hierarchical clustering has been the primary analytical tool used to define disease subtypes from microarray experiments in cancer settings. Assessing cluster reliability poses a major complication in analyzing output from clustering procedures. While most work has focused on estimating the number of clusters in a dataset, the question of stability of individual-level clusters has not been addressed.ResultsWe address this problem by developing cluster stability scores using subsampling techniques. These scores exploit the redundancy in biologically discriminatory information on the chip. Our approach is generic and can be used with any clustering method. We propose procedures for calculating cluster stability scores for situations involving both known and unknown numbers of clusters. We also develop cluster-size adjusted stability scores. The method is illustrated by application to data three cancer studies; one involving childhood cancers, the second involving B-cell lymphoma, and the final is from a malignant melanoma study.AvailabilityCode implementing the proposed analytic method can be obtained at the second author's website.

[1]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[2]  Richard M. Simon,et al.  Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data , 2002, Bioinform..

[3]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[4]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[5]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[6]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[7]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[9]  Debashis Ghosh,et al.  STATISTICAL ISSUES IN THE CLUSTERING OF GENE EXPRESSION DATA , 2001 .

[10]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[11]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[12]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[13]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[14]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[15]  Robert Tibshirani,et al.  EXPLORATORY SCREENING OF GENES AND CLUSTERS FROM MICROARRAY EXPERIMENTS , 2002 .

[16]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[17]  R. Tibshirani,et al.  Combining Estimates in Regression and Classification , 1996 .

[18]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[20]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.