Consensus clustering applied to multi-omics disease subtyping

Background Facing the diversity of omics data and the difficulty of selecting one result over all those produced by several methods, consensus strategies have the potential to reconcile multiple inputs and to produce robust results. Results Here, we introduce ClustOmics, a generic consensus clustering tool that we use in the context of cancer subtyping. ClustOmics relies on a non-relational graph database, which allows for the simultaneous integration of both multiple omics data and results from various clustering methods. This new tool conciliates input clusterings, regardless of their origin, their number, their size or their shape. ClustOmics implements an intuitive and flexible strategy, based upon the idea of evidence accumulation clustering . ClustOmics computes co-occurrences of pairs of samples in input clusters and uses this score as a similarity measure to reorganize data into consensus clusters. Conclusion We applied ClustOmics to multi-omics disease subtyping on real TCGA cancer data from ten different cancer types. We showed that ClustOmics is robust to heterogeneous qualities of input partitions, smoothing and reconciling preliminary predictions into high-quality consensus clusters, both from a computational and a biological point of view. The comparison to a state-of-the-art consensus-based integration tool, COCA, further corroborated this statement. However, the main interest of ClustOmics is not to compete with other tools, but rather to make profit from their various predictions when no gold-standard metric is available to assess their significance. Availability The ClustOmics source code, released under MIT license, and the results obtained on TCGA cancer data are available on GitHub: .

[1]  David B. Dunson,et al.  Bayesian consensus clustering , 2013, Bioinform..

[2]  Corrado Priami,et al.  Multi-omics integration - a comparison of unsupervised clustering methodologies , 2019, Briefings Bioinform..

[3]  Guangchuang Yu,et al.  clusterProfiler: an R package for comparing biological themes among gene clusters. , 2012, Omics : a journal of integrative biology.

[4]  Marina Vannucci,et al.  A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. , 2018, Biostatistics.

[5]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[6]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Shoshana J. Wodak,et al.  Markov clustering versus affinity propagation for the partitioning of protein interaction graphs , 2009, BMC Bioinformatics.

[8]  Valeria Vitelli,et al.  Integrative clustering reveals a novel split in the luminal A subtype of breast cancer with impact on outcome , 2017, Breast Cancer Research.


[10]  Daniela M Witten,et al.  Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data , 2009, Statistical applications in genetics and molecular biology.

[11]  Jacques van Helden,et al.  Evaluation of clustering algorithms for protein-protein interaction networks , 2006, BMC Bioinformatics.

[12]  Paul D W Kirk,et al.  Multiple kernel learning for integrative consensus clustering of omic datasets , 2020, Bioinform..

[13]  C. Bloomfield,et al.  Pretreatment cytogenetic abnormalities are predictive of induction success, cumulative incidence of relapse, and overall survival in adult patients with de novo acute myeloid leukemia: results from Cancer and Leukemia Group B (CALGB 8461). , 2002, Blood.

[14]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[15]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[16]  S. Drăghici,et al.  A novel approach for data integration and disease subtyping , 2017, Genome research.

[17]  D. Steinley Properties of the Hubert-Arabie adjusted Rand index. , 2004, Psychological methods.

[18]  Jesse S. Voss,et al.  Non-V600 BRAF Mutations Define a Clinically Distinct Molecular Subtype of Metastatic Colorectal Cancer. , 2017, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[19]  I sabel Mortara,et al.  International Union against Cancer , 1938, Nature.

[20]  R. Shamir,et al.  Inaccuracy of the log‐rank approximation in cancer data analysis , 2019, Molecular systems biology.

[21]  R. Shamir,et al.  Multi-omic and multi-view clustering algorithms: review and cancer benchmark , 2018, bioRxiv.

[22]  A. Ashworth,et al.  Breast cancer molecular profiling with single sample predictors: a retrospective analysis. , 2010, The Lancet. Oncology.

[23]  Lorenz Wernisch,et al.  Clusternomics: Integrative context-dependent clustering for heterogeneous datasets , 2017, bioRxiv.

[24]  S. Fortunato,et al.  Resolution limit in community detection , 2006, Proceedings of the National Academy of Sciences.

[25]  Santo Fortunato,et al.  Limits of modularity maximization in community detection , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[26]  Nico Pfeifer,et al.  Integrating different data types by regularized unsupervised multiple kernel learning with application to cancer subtype discovery , 2015, Bioinform..

[27]  Joshua M. Korn,et al.  Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2008, Nature.

[28]  H. Gralnick,et al.  Proposals for the Classification of the Acute Leukaemias French‐American‐British (FAB) Co‐operative Group , 1976, British journal of haematology.

[29]  S. Dongen Graph clustering by flow simulation , 2000 .

[30]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[31]  Raj Bhatnagar,et al.  Graph Clustering Using Mutual K-Nearest Neighbors , 2014, AMT.

[32]  Michael Q. Zhang,et al.  Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: application to cancer molecular classification , 2015, BMC Genomics.

[33]  Lynne Penberthy,et al.  Cancer Incidence and Survival Trends by Subtype Using Data from the Surveillance Epidemiology and End Results Program, 1992–2013 , 2016, Cancer Epidemiology, Biomarkers & Prevention.

[34]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[35]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[36]  Sunil R. Lakhani,et al.  Molecular classification of breast carcinoma , 2012 .

[37]  Feiping Nie,et al.  Multi-View Clustering and Feature Learning via Structured Sparsity , 2013, ICML.

[38]  S. Fallahpour,et al.  Breast cancer survival by molecular subtype: a population-based analysis of cancer registry data. , 2017, CMAJ open.

[39]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[41]  R. Shamir,et al.  Expression and methylation patterns partition luminal-A breast tumors into distinct prognostic subgroups , 2016, Breast cancer research : BCR.

[42]  A. Nobel,et al.  Supervised risk predictor of breast cancer based on intrinsic subtypes. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[43]  Sandro Vega-Pons,et al.  A Survey of Clustering Ensemble Algorithms , 2011, Int. J. Pattern Recognit. Artif. Intell..

[44]  Benjamin J. Raphael,et al.  Multiplatform Analysis of 12 Cancer Types Reveals Molecular Classification within and across Tissues of Origin , 2014, Cell.

[45]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[46]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[47]  Ron Shamir,et al.  NEMO: cancer subtyping by integration of partial multi-omic data , 2018, bioRxiv.

[48]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[49]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[50]  Lior Rokach,et al.  Clustering Methods , 2005, The Data Mining and Knowledge Discovery Handbook.

[51]  Emden R. Gansner,et al.  Using automatic clustering to produce high-level system organizations of source code , 1998, Proceedings. 6th International Workshop on Program Comprehension. IWPC'98 (Cat. No.98TB100242).

[52]  Witold Pedrycz,et al.  Protein complex identification through Markov clustering with firefly algorithm on dynamic protein-protein interaction networks , 2016, Inf. Sci..

[53]  M. Ramam,et al.  Skin Tumours , 2012, Journal of cutaneous and aesthetic surgery.