MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering

BackgroundUncovering subtypes of disease from microarray samples has important clinical implications such as survival time and sensitivity of individual patients to specific therapies. Unsupervised clustering methods have been used to classify this type of data. However, most existing methods focus on clusters with compact shapes and do not reflect the geometric complexity of the high dimensional microarray clusters, which limits their performance.ResultsWe present a cluster-number-based ensemble clustering algorithm, called MULTI-K, for microarray sample classification, which demonstrates remarkable accuracy. The method amalgamates multiple k-means runs by varying the number of clusters and identifies clusters that manifest the most robust co-memberships of elements. In addition to the original algorithm, we newly devised the entropy-plot to control the separation of singletons or small clusters. MULTI-K, unlike the simple k-means or other widely used methods, was able to capture clusters with complex and high-dimensional structures accurately. MULTI-K outperformed other methods including a recently developed ensemble clustering algorithm in tests with five simulated and eight real gene-expression data sets.ConclusionThe geometric complexity of clusters should be taken into account for accurate classification of microarray data, and ensemble clustering applied to the number of clusters tackles the problem very well. The C++ code and the data sets tested are available from the authors.

[1]  Sung Ho Yoon,et al.  Ensemble learning of genetic networks from time-series expression data , 2007, Bioinform..

[2]  Ana L. N. Fred,et al.  Robust data clustering , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[3]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[4]  Mario Medvedovic,et al.  Bayesian infinite mixture model based clustering of gene expression profiles , 2002, Bioinform..

[5]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[6]  C. Croce,et al.  The role of microRNA genes in papillary thyroid carcinoma. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[7]  L. Pusztai,et al.  Molecular classification of breast cancer: implications for selection of adjuvant chemotherapy , 2006, Nature Clinical Practice Oncology.

[8]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[9]  Loris Nanni,et al.  Ensemble of classifiers for protein fold recognition , 2006, Neurocomputing.

[10]  Giorgio Valentini,et al.  Ensembles Based on Random Projections to Improve the Accuracy of Clustering Algorithms , 2005, WIRN/NAIS.

[11]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[12]  Zhiwen Yu,et al.  Graph-based consensus clustering for class discovery from gene expression data , 2007, Bioinform..

[13]  L. Staudt,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[14]  Kuo-Chen Chou,et al.  Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[15]  Richard M. Simon,et al.  Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data , 2002, Bioinform..

[16]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[17]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[18]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[19]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[20]  Doron Lancet,et al.  Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification , 2005, Bioinform..

[21]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[22]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[23]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[24]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[25]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[26]  Doheon Lee,et al.  Detecting clusters of different geometrical shapes in microarray gene expression data , 2005, Bioinform..

[27]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[28]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[29]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[30]  R. Tibshirani,et al.  Repeated observation of breast tumor subtypes in independent gene expression data sets , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[31]  K. J. Ray Liu,et al.  Ensemble dependence model for classification and prediction of cancer and normal gene expression data , 2005, Bioinform..

[32]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[33]  Piero Fariselli,et al.  An ENSEMBLE machine learning approach for the prediction of all-alpha membrane proteins , 2003, ISMB.

[34]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Ludmila I. Kuncheva,et al.  Evaluation of Stability of k-Means Cluster Ensembles with Respect to Random Initialization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[37]  Xiaohui Liu,et al.  Consensus clustering and functional interpretation of gene-expression data , 2004, Genome Biology.

[38]  Debashis Ghosh,et al.  Cluster stability scores for microarray data in cancer studies , 2003, BMC Bioinformatics.

[39]  Zhaohui S. Qin,et al.  Clustering microarray gene expression data using weighted Chinese restaurant process , 2006, Bioinform..

[40]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.