Double Selection Based Semi-Supervised Clustering Ensemble for Tumor Clustering from Gene Expression Profiles

Tumor clustering is one of the important techniques for tumor discovery from cancer gene expression profiles, which is useful for the diagnosis and treatment of cancer. While different algorithms have been proposed for tumor clustering, few make use of the expert's knowledge to better the performance of tumor discovery. In this paper, we first view the expert's knowledge as constraints in the process of clustering, and propose a feature selection based semi-supervised cluster ensemble framework (FS-SSCE) for tumor clustering from bio-molecular data. Compared with traditional tumor clustering approaches, the proposed framework FS-SSCE is featured by two properties: (1) The adoption of feature selection techniques to dispel the effect of noisy genes. (2) The employment of the binate constraint based K-means algorithm to take into account the effect of experts' knowledge. Then, a double selection based semi-supervised cluster ensemble framework (DS-SSCE) which not only applies the feature selection technique to perform gene selection on the gene dimension, but also selects an optimal subset of representative clustering solutions in the ensemble and improve the performance of tumor clustering using the normalized cut algorithm. DS-SSCE also introduces a confidence factor into the process of constructing the consensus matrix by considering the prior knowledge of the data set. Finally, we design a modified double selection based semi-supervised cluster ensemble framework (MDS-SSCE) which adopts multiple clustering solution selection strategies and an aggregated solution selection function to choose an optimal subset of clustering solutions. The results in the experiments on cancer gene expression profiles show that (i) FS-SSCE, DS-SSCE and MDS-SSCE are suitable for performing tumor clustering from bio-molecular data. (ii) MDS-SSCE outperforms a number of state-of-the-art tumor clustering approaches on most of the data sets.

[1]  Simon C. K. Shiu,et al.  Molecular Pattern Discovery Based on Penalized Matrix Decomposition , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  Amir Jazaeri,et al.  Microarray analysis reveals distinct gene expression profiles among different histologic types of endometrial cancer. , 2003, Cancer research.

[3]  Kwong-Sak Leung,et al.  Data Mining on DNA Sequences of Hepatitis B Virus , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  Debashis Ghosh,et al.  Cluster stability scores for microarray data in cancer studies , 2003, BMC Bioinformatics.

[5]  Satoru Miyano,et al.  A Top-r Feature Selection Algorithm for Microarray Gene Expression Data , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Xiaoli Z. Fern,et al.  Cluster Ensemble Selection , 2008, Stat. Anal. Data Min..

[7]  Jung-Hsien Chiang,et al.  A Combination of Rough-Based Feature Selection and RBF Neural Network for Classification Using Gene Expression Data , 2008, IEEE Transactions on NanoBioscience.

[8]  U. Alon,et al.  Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. , 2001, Cancer research.

[9]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[10]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[11]  Alexander Schliep,et al.  Clustering cancer gene expression data: a comparative study , 2008, BMC Bioinformatics.

[12]  Hong Yan,et al.  Finding Correlated Biclusters from Gene Expression Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[13]  Qiang Cheng,et al.  The Fisher-Markov Selector: Fast Selecting Maximally Separable Feature Subset for Multiclass Classification with Applications to High-Dimensional Data , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Jun Gu,et al.  Efficient Semisupervised MEDLINE Document Clustering With MeSH-Semantic and Global-Content Constraints , 2013, IEEE Transactions on Cybernetics.

[15]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[16]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[17]  Kai Yu,et al.  Feature Selection for Gene Expression Using Model-Based Entropy , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Torben F. Ørntoft,et al.  Identifying distinct classes of bladder carcinoma using microarrays , 2003, Nature Genetics.

[19]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[20]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[22]  Jane You,et al.  Representative Distance: A New Similarity Measure for Class Discovery From Gene Expression Data , 2012, IEEE Transactions on NanoBioscience.

[23]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[25]  Zhiwen Yu,et al.  Class Discovery From Gene Expression Data Based on Perturbation and Cluster Ensemble , 2009, IEEE Transactions on NanoBioscience.

[26]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[27]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[28]  Giorgio Valentini,et al.  Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses , 2006, Artif. Intell. Medicine.

[29]  Zhiwen Yu,et al.  Graph-based consensus clustering for class discovery from gene expression data , 2007, Bioinform..

[30]  Lei Zhang,et al.  Tumor Clustering Using Nonnegative Matrix Factorization With Gene Selection , 2009, IEEE Transactions on Information Technology in Biomedicine.

[31]  Tossapon Boongoen,et al.  LCE: a link-based cluster ensemble method for improved gene expression data analysis , 2010, Bioinform..

[32]  Pritha Mahata Exploratory Consensus of Hierarchical Clusterings for Melanoma and Breast Cancer , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[34]  Jill P. Mesirov,et al.  Subclass Mapping: Identifying Common Subtypes in Independent Disease Data Sets , 2007, PloS one.

[35]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[36]  Roberto Avogadri,et al.  Fuzzy ensemble clustering based on random projections for DNA microarray data analysis , 2009, Artif. Intell. Medicine.

[37]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[38]  Yue Han,et al.  Stable Gene Selection from Microarray Data via Sample Weighting , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[39]  Giorgio Valentini,et al.  Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data , 2006, Bioinform..

[40]  Jane You,et al.  SC³: Triple Spectral Clustering-Based Consensus Clustering Framework for Class Discovery from Cancer Gene Expression Profiles , 2012, TCBB.

[41]  Feng Yang,et al.  Robust Feature Selection for Microarray Data Based on Multicriterion Fusion , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[42]  Aleks Jakulin Machine Learning Based on Attribute Interactions , 2005 .

[43]  J.C. Rajapakse,et al.  SVM-RFE With MRMR Filter for Gene Selection , 2010, IEEE Transactions on NanoBioscience.

[44]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[45]  Kezhi Mao,et al.  Recursive Mahalanobis Separability Measure for Gene Subset Selection , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[46]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[47]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Huan Liu,et al.  Advancing Feature Selection Research − ASU Feature Selection Repository , 2010 .