SC³: Triple Spectral Clustering-Based Consensus Clustering Framework for Class Discovery from Cancer Gene Expression Profiles

In order to perform successful diagnosis and treatment of cancer, discovering, and classifying cancer types correctly is essential. One of the challenging properties of class discovery from cancer data sets is that cancer gene expression profiles not only include a large number of genes, but also contains a lot of noisy genes. In order to reduce the effect of noisy genes in cancer gene expression profiles, we propose two new consensus clustering frameworks, named as triple spectral clustering-based consensus clustering (SC^{3}) and double spectral clustering-based consensus clustering (SC^{2}Ncut) in this paper, for cancer discovery from gene expression profiles. SC^{3} integrates the spectral clustering (SC) algorithm multiple times into the ensemble framework to process gene expression profiles. Specifically, spectral clustering is applied to perform clustering on the gene dimension and the cancer sample dimension, and also used as the consensus function to partition the consensus matrix constructed from multiple clustering solutions. Compared with SC^{3}, SC^{2}Ncut adopts the normalized cut algorithm, instead of spectral clustering, as the consensus function. Experiments on both synthetic data sets and real cancer gene expression profiles illustrate that the proposed approaches not only achieve good performance on gene expression profiles, but also outperforms most of the existing approaches in the process of class discovery from these profiles.

[1]  Geoffrey J. McLachlan,et al.  Mixtures of common t-factor analyzers for clustering high-dimensional microarray data , 2011, Bioinform..

[2]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[3]  Zhiwen Yu,et al.  Class Discovery From Gene Expression Data Based on Perturbation and Cluster Ensemble , 2009, IEEE Transactions on NanoBioscience.

[4]  Jill P. Mesirov,et al.  Subclass Mapping: Identifying Common Subtypes in Independent Disease Data Sets , 2007, PloS one.

[5]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[6]  Andrew E. Teschendorff,et al.  PACK: Profile Analysis using Clustering and Kurtosis to find molecular classifiers in cancer , 2006, Bioinform..

[7]  Jaap Heringa,et al.  Accurate confidence aware clustering of array CGH tumor profiles , 2010, Bioinform..

[8]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[9]  Pritha Mahata,et al.  Exploratory Consensus of Hierarchical Clusterings for Melanoma and Breast Cancer , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[11]  A. Orth,et al.  Large-scale analysis of the human and mouse transcriptomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Davide Risso,et al.  A novel approach to the clustering of microarray data via nonparametric density estimation , 2011, BMC Bioinformatics.

[13]  Debashis Ghosh,et al.  Cluster stability scores for microarray data in cancer studies , 2003, BMC Bioinformatics.

[14]  Adil M. Bagirov,et al.  New algorithms for multi-class cancer diagnosis using tumor gene expression signatures , 2003, Bioinform..

[15]  Devin C. Koestler,et al.  Semi-supervised recursively partitioned mixture models for identifying cancer subtypes , 2010, Bioinform..

[16]  Giorgio Valentini,et al.  Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data , 2006, Bioinform..

[17]  Roberto Avogadri,et al.  Fuzzy ensemble clustering based on random projections for DNA microarray data analysis , 2009, Artif. Intell. Medicine.

[18]  I. Dhillon,et al.  Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue Coclustering , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[19]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[20]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[21]  Simon C. K. Shiu,et al.  Molecular Pattern Discovery Based on Penalized Matrix Decomposition , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  Ludmila I. Kuncheva,et al.  Evaluation of Stability of k-Means Cluster Ensembles with Respect to Random Initialization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Robert Tibshirani,et al.  Cluster Validation by Prediction Strength , 2005 .

[24]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[26]  J. Douglas Armstrong,et al.  Merged consensus clustering to assess and improve class discovery with microarray data , 2010, BMC Bioinformatics.

[27]  Alexander Schliep,et al.  Clustering cancer gene expression data: a comparative study , 2008, BMC Bioinformatics.

[28]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[29]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[30]  Yi Zhang,et al.  Prognostic gene expression signatures can be measured in tissues collected in RNAlater preservative. , 2006, The Journal of molecular diagnostics : JMD.

[31]  Heather J. Ruskin,et al.  Techniques for clustering gene expression data , 2008, Comput. Biol. Medicine.

[32]  John Quackenbush,et al.  A three-gene model to robustly identify breast cancer molecular subtypes. , 2012, Journal of the National Cancer Institute.

[33]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[34]  Lei Zhang,et al.  Tumor Clustering Using Nonnegative Matrix Factorization With Gene Selection , 2009, IEEE Transactions on Information Technology in Biomedicine.

[35]  Peng Qiu,et al.  Simultaneous Class Discovery and Classification of Microarray Data Using Spectral Analysis , 2009, J. Comput. Biol..

[36]  Guillermo Ricardo Simari,et al.  Non-commercial Research and Educational Use including without Limitation Use in Instruction at Your Institution, Sending It to Specific Colleagues That You Know, and Providing a Copy to Your Institution's Administrator. All Other Uses, Reproduction and Distribution, including without Limitation Comm , 2022 .

[37]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Rosemary Braun,et al.  Partition decoupling for multi-gene analysis of gene expression profiling data , 2010, BMC Bioinformatics.

[39]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[40]  Giorgio Valentini,et al.  Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses , 2006, Artif. Intell. Medicine.

[41]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[42]  Tossapon Boongoen,et al.  LCE: a link-based cluster ensemble method for improved gene expression data analysis , 2010, Bioinform..

[43]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[44]  Zhiwen Yu,et al.  Knowledge Based Cluster Ensemble for Cancer Discovery From Biomolecular Data , 2011, IEEE Transactions on NanoBioscience.

[45]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[46]  E. Dougherty,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[47]  H. Akaike Prediction and Entropy , 1985 .

[48]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Ole Winther,et al.  Robust multi-scale clustering of large DNA microarray datasets with the consensus algorithm , 2006, Bioinform..

[50]  Pradipta Maji,et al.  Mutual Information-Based Supervised Attribute Clustering for Microarray Sample Classification , 2012, IEEE Transactions on Knowledge and Data Engineering.

[51]  Giorgio Valentini,et al.  Discovering multi–level structures in bio-molecular data through the Bernstein inequality , 2008, BMC Bioinformatics.

[52]  Giorgio Valentini Mosclust: a software library for discovering significant structures in bio-molecular data , 2007, Bioinform..

[53]  Danny Coomans,et al.  Clustering Microarrays with Predictive Weighted Ensembles , 2007, 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[54]  Kwong-Sak Leung,et al.  Data Mining on DNA Sequences of Hepatitis B Virus , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[55]  L. B. Jack,et al.  Investigation of Self-Organizing Oscillator Networks for Use in Clustering Microarray Data , 2008, IEEE Transactions on NanoBioscience.

[56]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[57]  Giorgio Valentini,et al.  Model order selection for bio-molecular data clustering , 2007, BMC Bioinformatics.

[58]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.