DeepMOCCA: A pan-cancer prognostic model identifies personalized prognostic markers through graph attention and multi-omics data integration

Combining multiple types of genomic, transcriptional, proteomic, and epigenetic datasets has the potential to reveal biological mechanisms across multiple scales, and may lead to more accurate models for clinical decision support. Developing efficient models that can derive clinical outcomes from high-dimensional data remains problematical; challenges include the integration of multiple types of omics data, inclusion of biological background knowledge, and developing machine learning models that are able to deal with this high dimensionality while having only few samples from which to derive a model. We developed DeepMOCCA, a framework for multi-omics cancer analysis. We combine different types of omics data using biological relations between genes, transcripts, and proteins, combine the multi-omics data with background knowledge in the form of protein–protein interaction networks, and use graph convolution neural networks to exploit this combination of multi-omics data and background knowledge. DeepMOCCA predicts survival time for individual patient samples for 33 cancer types and outperforms most existing survival prediction methods. Moreover, DeepMOCCA includes a graph attention mechanism which prioritizes driver genes and prognostic markers in a patient-specific manner; the attention mechanism can be used to identify drivers and prognostic markers within cohorts and individual patients. Author summary Linking the features of tumors to a prognosis for the patient is a critical part of managing cancer. Many methods have been applied to this problem but we still lack accurate prognostic markers for many cancers. We now have more information than ever before on the state of the cancer genome, the epigenetic changes in tumors, and gene expression at both RNA and protein levels. Here, we address the question of how this data can be used to predict cancer survival and discover which tumor genes make the greatest contribution to the prognosis in individual tumor samples. We have developed a computational model, DeepMOCCA, that uses artificial neural networks underpinned by a large graph constructed from background knowledge concerning the functional interactions between genes and their products. We show that with our method, DeepMOCCA can predict cancer survival time based entirely on features of the tumor at a cellular and molecular level. The method confirms many existing genes that affect survival but for some cancers suggests new genes, either not implicated in survival before or not known to be important in that particular cancer. The ability to predict the important features in individual tumors provided by our method raises the possibility of personalized therapy based on the gene or network dominating the prognosis for that patient.

[1]  G. Pazour,et al.  Ror2 signaling regulates Golgi structure and transport through IFT20 for tumor invasiveness , 2017, Scientific Reports.

[2]  Y. Hoshida,et al.  Cancer biomarker discovery and validation. , 2015, Translational cancer research.

[3]  C. Muir,et al.  International Classification of Diseases for Oncology , 1990 .

[4]  Wei Jiang,et al.  High-throughput DNA methylation profiling using universal bead arrays. , 2006, Genome research.

[5]  Fleur Mougin,et al.  Building a model for disease classification integration in oncology, an approach based on the national cancer institute thesaurus , 2017, J. Biomed. Semant..

[6]  R. Altman,et al.  Biomarkers: Delivering on the expectation of molecularly driven, quantitative health , 2018, Experimental biology and medicine.

[7]  Vasant Honavar,et al.  Min-redundancy and max-relevance multi-view feature selection for predicting ovarian cancer survival using multi-omics data , 2018, BMC Medical Genomics.

[8]  L. Sobin,et al.  International Classification of Diseases for Oncology, Third Edition , 2020, Definitions.

[9]  David J Harrison,et al.  Cancer systems biology. , 2016, Methods in molecular biology.

[10]  Y. Kojima,et al.  Renal-type Clear Cell Carcinoma Occurring in the Prostate With Zinner Syndrome , 2016, Urology case reports.

[11]  Martin Grohe,et al.  Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks , 2018, AAAI.

[12]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..

[13]  Olivier Gevaert,et al.  Deep learning with multimodal representation for pancancer prognosis prediction , 2019, Bioinform..

[14]  Tom R. Gaunt,et al.  Predicting the functional consequences of cancer-associated amino acid substitutions , 2013, Bioinform..

[15]  Dong Yu,et al.  Deep Learning: Methods and Applications , 2014, Found. Trends Signal Process..

[16]  M. Stratton,et al.  The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website , 2004, British Journal of Cancer.

[17]  Francesca Vitali,et al.  Integrated Multi-Omics Analyses in Oncology: A Review of Machine Learning Methods and Tools , 2020, Frontiers in Oncology.

[18]  T. Reya,et al.  Stem cell fate in cancer growth, progression and therapy resistance , 2018, Nature Reviews Cancer.

[19]  Ion Stoica,et al.  Tune: A Research Platform for Distributed Model Selection and Training , 2018, ArXiv.

[20]  Olivier Gevaert,et al.  Deep learning with multimodal representation for pancancer prognosis prediction , 2019, bioRxiv.

[21]  L. Siu,et al.  Molecular profiling for precision cancer therapies , 2020, Genome Medicine.

[22]  Peter W. Laird,et al.  Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer , 2018, Cell.

[23]  Kumardeep Chaudhary,et al.  Deep Learning–Based Multi-Omics Integration Robustly Predicts Survival in Liver Cancer , 2017, Clinical Cancer Research.

[24]  Steven J. M. Jones,et al.  Oncogenic Signaling Pathways in The Cancer Genome Atlas. , 2018, Cell.

[25]  Janet M Thornton,et al.  The SDR (short-chain dehydrogenase/reductase and related enzymes) nomenclature initiative. , 2009, Chemico-biological interactions.

[26]  Joshua E. Lewis,et al.  Predicting clinical outcomes from large scale cancer genomic profiles with deep survival models , 2017, Scientific Reports.

[27]  P. Trott,et al.  International Classification of Diseases for Oncology , 1977 .

[28]  Nobuhiko Sugano,et al.  A cross-sectional study on the age-related cortical and trabecular bone changes at the femoral head in elderly female hip fracture patients , 2019, Scientific Reports.

[29]  Tetsutaro Hayashi,et al.  TUBB3 Reverses Resistance to Docetaxel and Cabazitaxel in Prostate Cancer , 2019, International journal of molecular sciences.

[30]  J. Weinstein,et al.  Integrated Analysis of TP53 Gene and Pathway Alterations in The Cancer Genome Atlas. , 2019, Cell reports.

[31]  F. Jin,et al.  Cytosolic TMEM88 promotes triple-negative breast cancer by interacting with Dvl , 2015, Oncotarget.

[32]  Hu Chen,et al.  Integrated Genomic Analysis of the Ubiquitin Pathway across Cancer Types. , 2018, Cell reports.

[33]  Roman Schulte-Sasse,et al.  Graph Convolutional Networks Improve the Prediction of Cancer Driver Genes , 2019, ICANN.

[34]  Charles Y. Lin,et al.  NRL and CRX Define Photoreceptor Identity and Reveal Subgroup-Specific Dependencies in Medulloblastoma. , 2018, Cancer cell.

[35]  Niko Beerenwinkel,et al.  Network-based integration of multi-omics data for prioritizing cancer genes , 2018, Bioinform..

[36]  I. Cha,et al.  Deep learning-based survival prediction of oral cancer patients , 2019, Scientific Reports.

[37]  Jun Li,et al.  New advances of TMEM88 in cancer initiation and progression, with special emphasis on Wnt signaling pathway , 2018, Journal of cellular physiology.

[38]  J. Weinstein,et al.  Erratum: Integrated Analysis of TP53 Gene and Pathway Alterations in The Cancer Genome Atlas (Cell Reports (2019) 28(5) (1370–1384.e5), (S221112471930885X), (10.1016/j.celrep.2019.07.001)) , 2019 .

[39]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[40]  Dong Yu,et al.  Foundations and Trends in Signal Processing: DEEP LEARNING - Methods and Applications , 2014 .

[41]  Xun Zhu,et al.  Cox-nnet: An artificial neural network method for prognosis prediction of high-throughput omics data , 2018, PLoS Comput. Biol..

[42]  Steven J. M. Jones,et al.  The Molecular Taxonomy of Primary Prostate Cancer , 2015, Cell.

[43]  Dimitrios I. Fotiadis,et al.  Machine learning applications in cancer prognosis and prediction , 2014, Computational and structural biotechnology journal.

[44]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[45]  A. Goldstein,et al.  Tumors of the neural crest: Common themes in development and cancer , 2015, Developmental dynamics : an official publication of the American Association of Anatomists.

[46]  Brandon D Gallas,et al.  Comparing two correlated C indices with right‐censored survival outcome: a one‐shot nonparametric approach , 2015, Statistics in medicine.

[47]  Steven J. M. Jones,et al.  Comprehensive Characterization of Cancer Driver Genes and Mutations , 2018, Cell.

[48]  Maher Rizkalla,et al.  SALMON: Survival Analysis Learning With Multi-Omics Neural Networks on Breast Cancer , 2019, Front. Genet..

[49]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[50]  Stephan Günnemann,et al.  Predict then Propagate: Graph Neural Networks meet Personalized PageRank , 2018, ICLR.

[51]  A. Lusis,et al.  Considerations for the design of omics studies , 2017 .

[52]  Allison P. Heath,et al.  Toward a Shared Vision for Cancer Genomic Data. , 2016, The New England journal of medicine.

[53]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[54]  Chandan Singh,et al.  Definitions, methods, and applications in interpretable machine learning , 2019, Proceedings of the National Academy of Sciences.

[55]  G. von Heijne,et al.  Tissue-based map of the human proteome , 2015, Science.

[56]  Huai Liu,et al.  Metamorphic Testing , 2018, ACM Comput. Surv..

[57]  Adrian V. Lee,et al.  An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics , 2018, Cell.

[58]  Steven J. M. Jones,et al.  Pathogenic Germline Variants in 10,389 Adult Cancers. , 2018, Cell.

[59]  F. Meric-Bernstam,et al.  Challenges with biomarkers in cancer drug discovery and development , 2018, Expert opinion on drug discovery.

[60]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[61]  Sherri de Coronado,et al.  NCI Thesaurus: A semantic model integrating cancer-related clinical and molecular information , 2007, J. Biomed. Informatics.

[62]  Jonathan E. Shoag,et al.  Prognostic value of the SPOP mutant genomic subclass in prostate cancer. , 2020, Urologic oncology.

[63]  A. Papavassiliou,et al.  Histone modifications as a pathogenic mechanism of colorectal tumorigenesis. , 2012, The international journal of biochemistry & cell biology.

[64]  G. Fan,et al.  DNA Methylation and Its Basic Function , 2013, Neuropsychopharmacology.

[65]  Aristotelis Tsirigos,et al.  A Deep Learning Framework for Predicting Response to Therapy in Cancer. , 2019, Cell reports.

[66]  S. Culine,et al.  Class III beta-tubulin expression predicts prostate tumor aggressiveness and patient response to docetaxel-based chemotherapy. , 2010, Cancer research.

[67]  Damian Szklarczyk,et al.  STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets , 2018, Nucleic Acids Res..

[68]  Ida Scheel,et al.  Time-to-Event Prediction with Neural Networks and Cox Regression , 2019, J. Mach. Learn. Res..

[69]  Agustin Gonzalez-Reymundez,et al.  Multi-omic signatures identify pan-cancer classes of tumors beyond tissue of origin , 2020, Scientific Reports.

[70]  G. Wainrib,et al.  Deep learning-based classification of mesothelioma improves prediction of patient outcome , 2019, Nature Medicine.

[71]  Bernard Ghanem,et al.  DeeperGCN: All You Need to Train Deeper GCNs , 2020, ArXiv.

[72]  Mehryar Mohri,et al.  Adaptation Based on Generalized Discrepancy , 2019, J. Mach. Learn. Res..

[73]  F. Cunningham,et al.  The Ensembl Variant Effect Predictor , 2016, Genome Biology.

[74]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[75]  Eivind Hovig,et al.  Personal Cancer Genome Reporter: variant interpretation report for precision oncology , 2017, bioRxiv.

[76]  Ping Wang,et al.  Machine Learning for Survival Analysis , 2019, ACM Comput. Surv..

[77]  Jung Hun Oh,et al.  Interpretable deep neural network for cancer survival analysis by integrating genomic and clinical data , 2019, BMC Medical Genomics.

[78]  A. Gonzalez-Perez,et al.  A compendium of mutational cancer driver genes , 2020, Nature Reviews Cancer.

[79]  Jan Eric Lenssen,et al.  Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[80]  R. Scott,et al.  Long Term Transcriptional Reactivation of Epigenetically Silenced Genes in Colorectal Cancer Cells Requires DNA Hypomethylation and Histone Acetylation , 2011, PloS one.

[81]  R. Karchin,et al.  CHASMplus Reveals the Scope of Somatic Missense Mutations Driving Human Cancers. , 2019, Cell systems.

[82]  Michael P. Schroeder,et al.  Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations , 2017, Genome Medicine.

[83]  D G Altman,et al.  Prognostic markers in cancer: the evolution of evidence from single studies to meta-analysis, and beyond , 2009, British Journal of Cancer.

[84]  Edward S. Kim,et al.  The current state of molecular testing in the treatment of patients with solid tumors, 2019 , 2019, CA: a cancer journal for clinicians.

[85]  Lejla Gurbeta,et al.  Application of Neural Networks for classification of Patau, Edwards, Down, Turner and Klinefelter Syndrome based on first trimester maternal serum screening data, ultrasonographic findings and patient demographics , 2018, BMC Medical Genomics.

[86]  Gianluca Bontempi,et al.  TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data , 2015, Nucleic acids research.

[87]  Andrew M. Jenkinson,et al.  The EBI RDF platform: linked open data for the life sciences , 2014, Bioinform..

[88]  Association between changes in gene signatures expression and disease activity among patients with systemic lupus erythematosus , 2019, BMC Medical Genomics.

[89]  Luciano Milanesi,et al.  Methods for the integration of multi-omics data: mathematical aspects , 2016, BMC Bioinformatics.

[90]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[91]  Yidong Chen,et al.  Prediction and Interpretation of Cancer Survival Using Graph Convolution Neural Networks. , 2021, Methods.

[92]  Jaewoo Kang,et al.  Self-Attention Graph Pooling , 2019, ICML.

[93]  Tanzila Saba,et al.  Recent advancement in cancer detection using machine learning: Systematic survey of decades, comparisons and challenges. , 2020, Journal of infection and public health.

[94]  Moritz Herrmann,et al.  Large-scale benchmark study of survival prediction methods using multi-omics data , 2020, Briefings Bioinform..

[95]  R. Pereira,et al.  Structural and molecular analysis of the cancer prostate cell line PC3: Oocyte zona pellucida glycoproteins. , 2018, Tissue & cell.

[96]  X. Chen,et al.  TTD: Therapeutic Target Database , 2002, Nucleic Acids Res..

[97]  Sanjay Purushotham,et al.  Survival outcome prediction in cervical cancer: Cox models vs deep‐learning model , 2019, American journal of obstetrics and gynecology.

[98]  J. Marioni,et al.  Multi‐Omics Factor Analysis—a framework for unsupervised integration of multi‐omics data sets , 2018, Molecular systems biology.

[99]  R. Houlston,et al.  Genome-wide association studies of cancer: current insights and future perspectives , 2017, Nature Reviews Cancer.

[100]  George C Tseng,et al.  Integrative clustering of multi-level omics data for disease subtype discovery using sequential double regularization. , 2017, Biostatistics.