Clustering cancer gene expression data by projective clustering ensemble

Gene expression data analysis has paramount implications for gene treatments, cancer diagnosis and other domains. Clustering is an important and promising tool to analyze gene expression data. Gene expression data is often characterized by a large amount of genes but with limited samples, thus various projective clustering techniques and ensemble techniques have been suggested to combat with these challenges. However, it is rather challenging to synergy these two kinds of techniques together to avoid the curse of dimensionality problem and to boost the performance of gene expression data clustering. In this paper, we employ a projective clustering ensemble (PCE) to integrate the advantages of projective clustering and ensemble clustering, and to avoid the dilemma of combining multiple projective clusterings. Our experimental results on publicly available cancer gene expression data show PCE can improve the quality of clustering gene expression data by at least 4.5% (on average) than other related techniques, including dimensionality reduction based single clustering and ensemble approaches. The empirical study demonstrates that, to further boost the performance of clustering cancer gene expression data, it is necessary and promising to synergy projective clustering with ensemble clustering. PCE can serve as an effective alternative technique for clustering gene expression data.

[1]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[2]  Jane You,et al.  Hybrid Fuzzy Cluster Ensemble Framework for Tumor Clustering from Biomolecular Data , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[4]  Esa Alhoniemi,et al.  Clustering of the self-organizing map , 2000, IEEE Trans. Neural Networks Learn. Syst..

[5]  Yun Li,et al.  Fuzzy feature selection based on min-max learning rule and extension matrix , 2008, Pattern Recognit..

[6]  Ole Winther,et al.  Robust multi-scale clustering of large DNA microarray datasets with the consensus algorithm , 2006, Bioinform..

[7]  Kerstin Vogler,et al.  Applications Of Multi Objective Evolutionary Algorithms , 2016 .

[8]  Juan A. Nepomuceno,et al.  Biclustering of Gene Expression Data by Correlation-Based Scatter Search , 2011, BioData Mining.

[9]  Jean-Pierre Barthélemy,et al.  The Median Procedure for Partitions , 1993, Partitioning Data Sets.

[10]  L. Mirny,et al.  Protein complexes and functional modules in molecular networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Richard M. Karp,et al.  CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts , 2001, ISMB.

[12]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[13]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[14]  Debashis Ghosh,et al.  Cluster stability scores for microarray data in cancer studies , 2003, BMC Bioinformatics.

[15]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[16]  Jian Ma,et al.  A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression , 2014, BMC Bioinformatics.

[17]  Patrice Koehl,et al.  Eleven quick tips for running an interdisciplinary short course for new graduate students , 2018, PLoS Comput. Biol..

[18]  Ujjwal Maulik,et al.  An Interactive Approach to Multiobjective Clustering of Gene Expression Patterns , 2013, IEEE Transactions on Biomedical Engineering.

[19]  Colin Studholme,et al.  An overlap invariant entropy measure of 3D medical image alignment , 1999, Pattern Recognit..

[20]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[21]  R. Clarke,et al.  Approaches to working in high-dimensional data spaces: gene expression microarrays , 2008, British Journal of Cancer.

[22]  Vipin Kumar,et al.  The Challenges of Clustering High Dimensional Data , 2004 .

[23]  Jane You,et al.  Adaptive Fuzzy Consensus Clustering Framework for Clustering Analysis of Cancer Data , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  Witold Pedrycz,et al.  Collaborative clustering with the use of Fuzzy C-Means and its quantification , 2008, Fuzzy Sets Syst..

[25]  Xiaolong Wang,et al.  Using distances between Top-n-gram and residue pairs for protein remote homology detection , 2014, BMC Bioinformatics.

[26]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Thomas Martinetz,et al.  'Neural-gas' network for vector quantization and its application to time-series prediction , 1993, IEEE Trans. Neural Networks.

[28]  Yi Lu,et al.  Incremental genetic K-means algorithm and its application in gene expression data analysis , 2004, BMC Bioinformatics.

[29]  Dingcheng Li,et al.  Spectral co-clustering ensemble , 2015, Knowl. Based Syst..

[30]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[31]  Dimitrios Gunopulos,et al.  Automatic Subspace Clustering of High Dimensional Data , 2005, Data Mining and Knowledge Discovery.

[32]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[33]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[34]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[35]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[36]  Zuhong Lu,et al.  Analyzing the similarity of samples and genes by MG-PCC algorithm, t-SNE-SS and t-SNE-SG maps , 2018, BMC Bioinformatics.

[37]  Patrik D'haeseleer,et al.  How does gene expression clustering work? , 2005, Nature Biotechnology.

[38]  Jill P. Mesirov,et al.  Subclass Mapping: Identifying Common Subtypes in Independent Disease Data Sets , 2007, PloS one.

[39]  R Balamurugan,et al.  Cancer Detection in Microarray Data Using a Modified Cat Swarm Optimization Clustering Approach , 2017 .

[40]  Piero Fariselli,et al.  Blurring contact maps of thousands of proteins: what we can learn by reconstructing 3D structure , 2011, BioData Mining.

[41]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[42]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[43]  Andrea Tagarelli,et al.  Projective clustering ensembles , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[44]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[45]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[46]  A. Brazma,et al.  Gene expression data analysis. , 2001, FEBS letters.

[47]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[48]  Tossapon Boongoen,et al.  Link-based cluster ensembles for heterogeneous biological data analysis , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[49]  Nir Friedman,et al.  Class discovery in gene expression data , 2001, RECOMB.

[50]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[51]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[52]  Roberto Avogadri,et al.  Fuzzy ensemble clustering based on random projections for DNA microarray data analysis , 2009, Artif. Intell. Medicine.

[53]  Dimitrios Gunopulos,et al.  Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[54]  Carlotta Domeniconi,et al.  Weighted cluster ensembles: Methods and analysis , 2009, TKDD.

[55]  Zhiwen Yu,et al.  Graph-based consensus clustering for class discovery from gene expression data , 2007, Bioinform..

[56]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[57]  Hans Binder,et al.  Epigenetic Heterogeneity of B-Cell Lymphoma: DNA Methylation, Gene Expression and Chromatin States , 2015, Genes.

[58]  Alexander Schliep,et al.  Clustering cancer gene expression data: a comparative study , 2008, BMC Bioinformatics.

[59]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[60]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 2017, Scientific Reports.

[61]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[62]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.