New Analysis Framework Incorporating Mixed Mutual Information and Scalable Bayesian Networks for Multimodal High Dimensional Genomic and Epigenomic Cancer Data

We propose a novel two-stage analysis strategy to discover candidate genes associated with the particular cancer outcomes in large multimodal genomic cancers databases, such as The Cancer Genome Atlas (TCGA). During the first stage, we use mixed mutual information to perform variable selection; during the second stage, we use scalable Bayesian network (BN) modeling to identify candidate genes and their interactions. Two crucial features of the proposed approach are (i) the ability to handle mixed data types (continuous and discrete, genomic, epigenomic, etc.) and (ii) a flexible boundary between the variable selection and network modeling stages — the boundary that can be adjusted in accordance with the investigators’ BN software scalability and hardware implementation. These two aspects result in high generalizability of the proposed analytical framework. We apply the above strategy to three different TCGA datasets (LGG, Brain Lower Grade Glioma; HNSC, Head and Neck Squamous Cell Carcinoma; STES, Stomach and Esophageal Carcinoma), linking multimodal molecular information (SNPs, mRNA expression, DNA methylation) to two clinical outcome variables (tumor status and patient survival). We identify 11 candidate genes, of which 6 have already been directly implicated in the cancer literature. One novel LGG prognostic factor suggested by our analysis, methylation of TMPRSS11F type II transmembrane serine protease, presents intriguing direction for the follow-up studies.

[1]  Clark Glymour,et al.  A million variables and more: the Fast Greedy Equivalence Search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images , 2016, International Journal of Data Science and Analytics.

[2]  T. Jiang,et al.  Tumor Purity as an Underlying Key Factor in Glioma , 2017, Clinical Cancer Research.

[3]  Chris Wiggins,et al.  ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context , 2004, BMC Bioinformatics.

[4]  P. Müller,et al.  Characterizing Cancer-Specific Networks by Integrating TCGA Data , 2014, Cancer informatics.

[5]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[6]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[7]  Craig Valli,et al.  A Wrapper-Based Feature Selection for Analysis of Large Data Sets , 2010 .

[8]  Poonam K Sharma,et al.  Expression of intestinal MUC17 membrane-bound mucin in inflammatory and neoplastic diseases of the colon , 2010, Journal of Clinical Pathology.

[9]  Swe Swe Myint,et al.  Exome sequencing identifies distinct mutational patterns in liver fluke–related and non-infection-related bile duct cancers , 2013, Nature Genetics.

[10]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[11]  Yu-Dong Cai,et al.  Novel Candidate Key Drivers in the Integrative Network of Genes, MicroRNAs, Methylations, and Copy Number Variations in Squamous Cell Lung Carcinoma , 2015, BioMed research international.

[12]  Tim De Meyer,et al.  Analysis of DNA methylation in cancer: location revisited , 2018, Nature Reviews Clinical Oncology.

[13]  J. Hui,et al.  Genetic Variants Associated with Increased Risk of Malignant Pleural Mesothelioma: A Genome-Wide Association Study , 2013, PloS one.

[14]  S. Shin,et al.  Data-driven Analysis of TRP Channels in Cancer: Linking Variation in Gene Expression to Clinical Significance. , 2016, Cancer genomics & proteomics.

[15]  Baosen Zhou,et al.  Multiple functional SNPs in differentially expressed genes modify risk and survival of non-small cell lung cancer in chinese female non-smokers , 2017, Oncotarget.

[16]  Jing-hua Zhang,et al.  Combined analysis of DNA methylation and gene expression profiles of osteosarcoma identified several prognosis signatures. , 2018, Gene.

[17]  A. Riggs,et al.  Analysis of high-resolution 3D intrachromosomal interactions aided by Bayesian network modeling , 2017, Proceedings of the National Academy of Sciences.

[18]  Bin Zhou,et al.  Integrated genomic characterization of cancer genes in glioma , 2017, Cancer Cell International.

[19]  Jiahai Shi,et al.  High TMPRSS11D protein expression predicts poor overall survival in non-small cell lung cancer , 2017, Oncotarget.

[20]  Srinivasan Parthasarathy,et al.  An ensemble framework for clustering protein-protein interaction networks , 2007, ISMB/ECCB.

[21]  Rajkumar,et al.  Correlations of polymorphisms in matrix metalloproteinase-1, -2, and -7 promoters to susceptibility to malignant gliomas , 2016, Asian journal of neurosurgery.

[22]  Clark Glymour,et al.  Mixed graphical models for integrative causal analysis with application to chronic lung disease diagnosis and prognosis , 2018, Bioinform..

[23]  M. Kanda,et al.  FAM46C Serves as a Predictor of Hepatic Recurrence in Patients with Resectable Gastric Cancer , 2017, Annals of Surgical Oncology.

[24]  Tsippi Iny Stein,et al.  The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analyses , 2016, Current protocols in bioinformatics.

[25]  Anthony Law,et al.  A Bayesian Network Model of Head and Neck Squamous Cell Carcinoma Incorporating Gene Expression Profiles , 2017, MedInfo.

[26]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[27]  M. Mandal,et al.  Insights into molecular therapy of glioma: current challenges and next generation blueprint , 2017, Acta Pharmacologica Sinica.

[28]  Kim-Anh Do,et al.  Integrative network-based Bayesian analysis of diverse genomics data , 2013, BMC Bioinformatics.

[29]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[30]  Sreeram Kannan,et al.  Estimating Mutual Information for Discrete-Continuous Mixtures , 2017, NIPS.

[31]  G. Wessel,et al.  Germline factor DDX4 functions in blood‐derived cancer cell phenotypes , 2017, Cancer science.

[32]  Andrei S. Rodin,et al.  New Algorithm and Software (BNOmics) for Inferring and Visualizing Bayesian Networks from Heterogeneous Big Biological and Genetic Data , 2017, J. Comput. Biol..

[33]  Riten Mitra,et al.  Zodiac: A Comprehensive Depiction of Genetic Interactions in Cancer by Integrating TCGA Data. , 2015, Journal of the National Cancer Institute.

[34]  Qingyang Zhang,et al.  Integrative network analysis of TCGA data for ovarian cancer , 2014, BMC Systems Biology.

[35]  T. Shimomura,et al.  Hepatocyte growth factor activator inhibitors (HAI‐1 and HAI‐2): Emerging key players in epithelial integrity and cancer , 2018, Pathology international.

[36]  Gregory F. Cooper,et al.  Scoring Bayesian networks of mixed variables , 2018, International Journal of Data Science and Analytics.

[37]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[38]  N. Navaratnam,et al.  Potassium channel KCNA1 modulates oncogene-induced senescence and transformation. , 2013, Cancer research.

[39]  Yuan Ji,et al.  A Bayesian graphical model for integrative analysis of TCGA data , 2012, Proceedings 2012 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS).

[40]  Andrei S. Rodin,et al.  Use of Wrapper Algorithms Coupled with a Random Forests Classifier for Variable Selection in Large-Scale Genomic Association Studies , 2009, J. Comput. Biol..

[41]  May D. Wang,et al.  Integration of multi-modal biomedical data to predict cancer grade and patient survival , 2016, 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI).

[42]  Núria Queralt-Rosinach,et al.  DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants , 2016, Nucleic Acids Res..

[43]  Yang Xu,et al.  Identification of glioblastoma gene prognosis modules based on weighted gene co-expression network analysis , 2018, BMC Medical Genomics.

[44]  T. Down,et al.  Genome Wide Analysis of Acute Myeloid Leukemia Reveal Leukemia Specific Methylome and Subtype Specific Hypomethylation of Repeats , 2012, PloS one.

[45]  Guiqing Jia,et al.  Genome-Wide Network-Based Analysis of Colorectal Cancer Identifies Novel Prognostic Factors and an Integrative Prognostic Index , 2018, Cellular Physiology and Biochemistry.

[46]  Kyung-Ah Sohn,et al.  Integrative network analysis for survival-associated gene-gene interactions across multiple genomic profiles in ovarian cancer , 2015, Journal of Ovarian Research.

[47]  Byong Chul Yoo,et al.  Clinical multi-omics strategies for the effective cancer management. , 2017, Journal of proteomics.

[48]  Bart De Moor,et al.  Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks , 2006, ISMB.

[49]  B. Fridley,et al.  Genome-Wide Study of Response to Platinum, Taxane, and Combination Therapy in Ovarian Cancer: In vitro Phenotypes, Inherited Variation, and Disease Recurrence , 2016, Front. Genet..

[50]  H. Brenner,et al.  Common genetic variation and survival after colorectal cancer diagnosis: a genome-wide analysis. , 2016, Carcinogenesis.

[51]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[52]  D. Klinke,et al.  Identifying causal networks linking cancer processes and anti‐tumor immunity using Bayesian network inference and metagene constructs , 2016, Biotechnology progress.