Bregmannian consensus clustering for cancer subtypes analysis

Cancer subtype analysis, as an extension of cancer diagnosis, can be regarded as a consensus clustering problem. This analysis is beneficial for providing patients with more accurate treatment. Consensus clustering refers to a situation in which several different clusters have been obtained for a particular data set, and it is desired to aggregate those clustering results to get a better clustering solution. In this paper, we propose to generalize the traditional consensus clustering methods in three manners: (1) We provide Bregmannian consensus clustering (BCC), where the loss between the consensus clustering result and all the input clusterings are generalized from a traditional Euclidean distance to a general Bregman loss; (2) we generalize the BCC to a weighted case, where each input clustering has different weights, providing a better solution for the final clustering result; and (3) we propose a novel semi-supervised consensus clustering, which adds some must-link and cannot-link constraints compared with the first two methods. Then, we obtain three cancer (breast, lung, colorectal cancer) data sets from The Cancer Genome Atlas (TCGA). Each data set has three data types (mRNA, mircoRNA, methylation), and each is respectively used to test the accuracy of the proposed algorithms for clusterings. The experimental results demonstrate that the highest aggregation accuracy of the weighted BCC (WBCC) on cancer data sets is 90.2%. Moreover, although the lowest accuracy is 62.3%, it is higher than other methods on the same data set. Therefore, we conclude that as compared with the competition, our method is more effective.

[1]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[2]  A. Jemal,et al.  Global Cancer Statistics , 2011 .

[3]  Dimitrios Gunopulos,et al.  Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[4]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[5]  Stephan Dempe,et al.  Directional differentiability of optimal solutions under Slater's condition , 1993, Math. Program..

[6]  Steven Skiena,et al.  Integrating Microarray Data By Consensus Clustering , 2004, Int. J. Artif. Intell. Tools.

[7]  David B. Dunson,et al.  Bayesian consensus clustering , 2013, Bioinform..

[8]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[9]  Fei Wang,et al.  Semi-Supervised Clustering via Matrix Factorization , 2008, SDM.

[10]  Anil K. Jain,et al.  Clustering ensembles: models of consensus and weak partitions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Roded Sharan,et al.  Algorithmic approaches to clustering gene expression data , 2001 .

[12]  Hui Xiong,et al.  K-Means-Based Consensus Clustering: A Unified View , 2015, IEEE Transactions on Knowledge and Data Engineering.

[13]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[14]  Ying Xu,et al.  Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees , 2002, Bioinform..

[15]  Jian Jhen Chen,et al.  K-means clustering versus validation measures: a data-distribution perspective. , 2009, IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics : a publication of the IEEE Systems, Man, and Cybernetics Society.

[16]  Chris H. Q. Ding,et al.  Solving Consensus and Semi-supervised Clustering Problems Using Nonnegative Matrix Factorization , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[17]  William F. Punch,et al.  A Comparison of Resampling Methods for Clustering Ensembles , 2004, IC-AI.

[18]  Sandro Vega-Pons,et al.  A Survey of Clustering Ensemble Algorithms , 2011, Int. J. Pattern Recognit. Artif. Intell..

[19]  Yunli Wang,et al.  Semi-supervised consensus clustering for gene expression data analysis , 2014, BioData Mining.

[20]  Jane You,et al.  Adaptive Fuzzy Consensus Clustering Framework for Clustering Analysis of Cancer Data , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[21]  Ossama Younis,et al.  HEED: a hybrid, energy-efficient, distributed clustering approach for ad hoc sensor networks , 2004, IEEE Transactions on Mobile Computing.

[22]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[24]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[25]  Wenchao Xiao,et al.  Semi-supervised hierarchical clustering ensemble and its application , 2016, Neurocomputing.

[26]  Carlotta Domeniconi,et al.  Weighted cluster ensembles: Methods and analysis , 2009, TKDD.

[27]  Zhiwen Yu,et al.  Graph-based consensus clustering for class discovery from gene expression data , 2007, Bioinform..

[28]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[29]  Vladimir Filkov,et al.  Consensus Clustering Algorithms: Comparison and Refinement , 2008, ALENEX.

[30]  Anil K. Jain,et al.  Combining multiple weak clusterings , 2003, Third IEEE International Conference on Data Mining.

[31]  Heather J. Ruskin,et al.  Techniques for clustering gene expression data , 2008, Comput. Biol. Medicine.

[32]  Sandro Vega-Pons,et al.  Weighted Cluster Ensemble Using a Kernel Consensus Function , 2008, CIARP.

[33]  Behnam Malakooti,et al.  Clustering and group selection of multiple criteria alternatives with application to space-based networks , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[34]  Carla E. Brodley,et al.  Solving cluster ensemble problems by bipartite graph partitioning , 2004, ICML.

[35]  Xiaoli Z. Fern,et al.  Cluster Ensemble Selection , 2008 .

[36]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[37]  Sergio Greco,et al.  Diversity-Based Weighting Schemes for Clustering Ensembles , 2009, SDM.

[38]  Ludmila I. Kuncheva,et al.  Evaluation of Stability of k-Means Cluster Ensembles with Respect to Random Initialization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Carlotta Domeniconi,et al.  Weighted Clustering Ensembles , 2006, SDM.

[40]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[41]  A. Jemal,et al.  Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries , 2018, CA: a cancer journal for clinicians.

[42]  Ana L. N. Fred,et al.  Semi-Supervised Consensus Clustering for ECG Pathology Classification , 2015, ECML/PKDD.