Cancer subtype identification by consensus guided graph autoencoders

MOTIVATION Cancer subtype identification aims to divide cancer patients into subgroups with distinct clinical phenotypes and facilitate the development for subgroup specific therapies. The massive amount of multi-omics datasets accumulated in the public databases have provided unprecedented opportunities to fulfill this task. As a result, great computational efforts have been made to accurately identify cancer subtypes via integrative analysis of these multi-omics datasets. RESULTS In this paper, we propose a Consensus Guided Graph Autoencoder (CGGA) to effectively identify cancer subtypes. First, we learn for each omic a new feature matrix by using graph autoencoders, where both structure information and node features can be effectively incorporated during the learning process. Second, we learn a set of omic-specific similarity matrices together with a consensus matrix based on the features obtained in the first step. The learned omic-specific similarity matrices are then fed back to the graph autoencoders to guide the feature learning. By iterating the two steps above, our method obtains a final consensus similarity matrix for cancer subtyping. To comprehensively evaluate the prediction performance of our method, we compare CGGA with several approaches ranging from general-purpose multi-view clustering algorithms to multi-omics-specific integrative methods. The experimental results on both generic datasets and cancer datasets confirm the superiority of our method. Moreover, we validate the effectiveness of our method in leveraging multi-omics datasets to identify cancer subtypes. In addition, we investigate the clinical implications of the obtained clusters for glioblastoma and provide new insights into the treatment for patients with different subtypes. AVAILABILITY The source code of our method is freely available at https://github.com/alcs417/CGGA. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Menglan Cai,et al.  Subtype identification from heterogeneous TCGA datasets on a genomic scale by multi-view clustering with enhanced consensus , 2017, BMC Medical Genomics.

[2]  Alioune Ngom,et al.  A review on machine learning principles for multi-view biological data integration , 2016, Briefings Bioinform..

[3]  Marina Vannucci,et al.  A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. , 2018, Biostatistics.

[4]  S. Varambally,et al.  Pan-cancer molecular subtypes revealed by mass-spectrometry-based proteomic characterization of more than 500 human cancers , 2019, Nature Communications.

[5]  Michael Q. Zhang,et al.  Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: application to cancer molecular classification , 2015, BMC Genomics.

[6]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[7]  Fei Guo,et al.  Discovering Cancer Subtypes via an Accurate Fusion Strategy on Multiple Profile Data , 2019, Front. Genet..

[8]  John Quackenbush,et al.  Cancer subtype identification using somatic mutation data , 2017, British Journal of Cancer.

[9]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[10]  Nico Pfeifer,et al.  Integrating different data types by regularized unsupervised multiple kernel learning with application to cancer subtype discovery , 2015, Bioinform..

[11]  F. Janku Tumor heterogeneity in the clinic: is it a real problem? , 2014, Therapeutic advances in medical oncology.

[12]  Liqiang Nie,et al.  Scalable Deep Hashing for Large-Scale Social Image Retrieval , 2020, IEEE Transactions on Image Processing.

[13]  S. Gabriel,et al.  Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. , 2010, Cancer cell.

[14]  Olivier Gevaert,et al.  Development and validation of radiomic signatures of head and neck squamous cell carcinoma molecular features and subtypes , 2019, EBioMedicine.

[15]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[16]  S. Drăghici,et al.  A novel approach for data integration and disease subtyping , 2017, Genome research.

[17]  George Michailidis,et al.  A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data , 2015, Bioinform..

[18]  F. Supek,et al.  Matching cell lines with cancer type and subtype of origin via mutational, epigenomic, and transcriptomic patterns , 2020, Science Advances.

[19]  Feiping Nie,et al.  Detecting Coherent Groups in Crowd Scenes by Multiview Clustering , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  R. Shamir,et al.  Multi-omic and multi-view clustering algorithms: review and cancer benchmark , 2018, bioRxiv.

[21]  Jiazhou Chen,et al.  Simultaneous Interrogation of Cancer Omics to Identify Subtypes With Significant Clinical Differences , 2019, Front. Genet..

[22]  Hao Wang,et al.  GMC: Graph-Based Multi-View Clustering , 2020, IEEE Transactions on Knowledge and Data Engineering.

[23]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[24]  Guangchuang Yu,et al.  clusterProfiler: an R package for comparing biological themes among gene clusters. , 2012, Omics : a journal of integrative biology.

[25]  D. Haussler,et al.  The Somatic Genomic Landscape of Glioblastoma , 2013, Cell.

[26]  Eric F. Lock,et al.  R.JIVE for exploration of multi-source molecular data , 2016, Bioinform..

[27]  Nam D Nguyen,et al.  Multiview learning for understanding functional multiomics , 2020, PLoS Comput. Biol..

[28]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[29]  Zhonghu Bai,et al.  Breast cancer intrinsic subtype classification, clinical use and future trends. , 2015, American journal of cancer research.

[30]  Xinwang Liu,et al.  Learning a Joint Affinity Graph for Multiview Subspace Clustering , 2019, IEEE Transactions on Multimedia.

[31]  Oznur Tastan,et al.  PAMOGK: a pathway graph kernel-based multiomics approach for patient clustering , 2020, Bioinform..

[32]  David Haussler,et al.  Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM , 2010, Bioinform..