Integrated Analysis of CNV, Gene Expression and Disease State Data in Prostate Cancer

Background: Copy number variation (CNV) may contribute to development of complex diseases. However, due to the complex mechanism of path association and the lack of sufficient samples, understanding the relationship between CNV and cancer remains a major challenge. The unprecedented abundance of CNV, gene and disease label data provide us with an opportunity to design a new machine learning framework to predict potential disease related CNVs.Results: In this paper, we developed a novel machine learning approach, namely IHI BMLLR (Integrating Heterogeneous Information sources with Biweight Mid correlation and L1 regularized Logistic Regression under stability selection), to predict the CNV disease path associations by using a data set containing CNV, disease state labels and gene data. CNVs, genes, and diseases are connected through edges, and then constitute a biological association network. To construct a biological network, we first used a self adaptive biweight mid correlation (BM) formula to calculate correlation coefficients between CNVs and genes. Then, we used logistic regression with L1 penalty (LLR) function to detect genes related to disease. We added stability selection strategy, which can effectively reduce false positives, when using self adaptive BM and LLR. Finally, a weighted path search algorithm was applied to find top D path associations and important CNVs.Conclusions: Compared with state of the art methods, IHI BMLLR discovers CNVs disease path associations by integrating analysis of CNV, gene expression and disease label data combined with stability selection strategy and weighted path search algorithm, thereby mining more information in the data sets, and improving the accuracy of obtained CNVs. The experimental results on both simulation and prostate cancer data show that IHI BMLLR is significantly better than two state of the art CNV detection methods (i.e., CCRET and DPtest) under false positive control. Furthermore, we applied IHI BMLLR to prostate cancer data and found significant path associations. Three new cancer related genes were discovered in the paths and these genes need to be verified by biological research in the future.

[1]  Martin N. Davis,et al.  Pan-Cancer Analysis of the Genomic Alterations and Mutations of the Matrisome , 2020, Cancers.

[2]  S. Cavallaro,et al.  The contribution of CNVs to the most common aging-related neurodegenerative diseases , 2020, Aging Clinical and Experimental Research.

[3]  Donghang Xu,et al.  Copy number variation is highly correlated with differential gene expression: a pan-cancer study , 2019, BMC Medical Genetics.

[4]  D. Gresham,et al.  An evolving view of copy number variants , 2019, Current Genetics.

[5]  Kyungsook Han,et al.  Integration of Multi-Omics Data for Gene Regulatory Network Inference and Application to Breast Cancer , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  R. Vierkant,et al.  Genome-wide Analysis of Common Copy Number Variation and Epithelial Ovarian Cancer Risk , 2019, Cancer Epidemiology, Biomarkers & Prevention.

[7]  De-shuang Huang,et al.  A Network-guided Association Mapping Approach from DNA Methylation to Disease , 2019, Scientific Reports.

[8]  Yinyi Chen,et al.  Identification of key candidate genes and biological pathways in bladder cancer , 2018, PeerJ.

[9]  Yu-Dong Cai,et al.  Identification of the copy number variant biomarkers for breast cancer subtypes , 2018, Molecular Genetics and Genomics.

[10]  K. Hao,et al.  EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data , 2018, bioRxiv.

[11]  James Y. Dai,et al.  Identifying disease‐associated copy number variations by a doubly penalized regression model , 2018, Biometrics.

[12]  Liang Fang,et al.  Identification of Core Genes and Key Pathways via Integrated Analysis of Gene Expression and DNA Methylation Profiles in Bladder Cancer , 2018, Medical science monitor : international medical journal of experimental and clinical research.

[13]  Adrian V. Lee,et al.  An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics , 2018, Cell.

[14]  Xiaodong Cui,et al.  MeTDiff: A Novel Differential RNA Methylation Analysis for MeRIP-Seq Data , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  Yuan-ming Pan,et al.  A Novel Method to Detect Early Colorectal Cancer Based on Chromosome Copy Number Variation in Plasma , 2018, Cellular Physiology and Biochemistry.

[16]  S. Zhang,et al.  S6K1 phosphorylation-dependent degradation of Mxi1 by β-Trcp ubiquitin ligase promotes Myc activation and radioresistance in lung cancer , 2018, Theranostics.

[17]  De-Shuang Huang,et al.  FAACOSE: A Fast Adaptive Ant Colony Optimization Algorithm for Detecting SNP Epistasis , 2017, Complex..

[18]  Junfeng Xia,et al.  Cancer Subtype Discovery Based on Integrative Model of Multigenomic Data , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[19]  Xiaobo Zhou,et al.  Nonconvex Penalty Based Low-Rank Representation and Sparse Regression for eQTL Mapping , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  Yufei Huang,et al.  QNB: differential RNA methylation analysis for count-based small-sample sequencing data with a quad-negative binomial model , 2017, BMC Bioinformatics.

[21]  Mary Goldman,et al.  Abstract 2584: The UCSC Xena system for cancer genomics data visualization and interpretation , 2017 .

[22]  C. Zheng,et al.  LNDriver: identifying driver genes by integrating mutation and expression data based on gene-gene interaction network , 2016, BMC Bioinformatics.

[23]  S. Cavallaro,et al.  Copy number variability in Parkinson’s disease: assembling the puzzle through a systems biology approach , 2016, Human Genetics.

[24]  K. Silverstein,et al.  CNV-RF Is a Random Forest-Based Copy Number Variation Detection Method Using Next-Generation Sequencing. , 2016, The Journal of molecular diagnostics : JMD.

[25]  Xiaodong Cui,et al.  A novel algorithm for calling mRNA m6A peaks by modeling biological variances in MeRIP-seq data , 2016, Bioinform..

[26]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[27]  J. Schleutker,et al.  Germline copy number variation analysis in Finnish families with hereditary prostate cancer , 2016, The Prostate.

[28]  M. Ladomery,et al.  The oncogene ERG: a key factor in prostate cancer , 2016, Oncogene.

[29]  J. Mesirov,et al.  The Molecular Signatures Database Hallmark Gene Set Collection , 2015 .

[30]  Patrick F. Sullivan,et al.  A New Method for Detecting Associations with Rare Copy-Number Variants , 2015, PLoS genetics.

[31]  De-shuang Huang,et al.  Module Based Differential Coexpression Analysis Method for Type 2 Diabetes , 2015, BioMed research international.

[32]  S. Mccarroll,et al.  Complex and multi-allelic copy number variation in human disease , 2015, Briefings in functional genomics.

[33]  J. Lupski Structural variation mutagenesis of the human genome: Impact on disease and evolution , 2015, Environmental and molecular mutagenesis.

[34]  A. Bahnassy,et al.  Differentially expressed genes in metastatic advanced Egyptian bladder cancer. , 2015, Asian Pacific journal of cancer prevention : APJCP.

[35]  J. R. MacDonald,et al.  A copy number variation map of the human genome , 2015, Nature Reviews Genetics.

[36]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[37]  Lin Yuan,et al.  Gene differential coexpression analysis based on biweight correlation and maximum clique , 2014, BMC Bioinformatics.

[38]  C. Sotiriou,et al.  Transfer of clinically relevant gene expression signatures in breast cancer: from Affymetrix microarray to Illumina RNA-Sequencing technology , 2014, BMC Genomics.

[39]  Martin J. Aryee,et al.  Epigenome-wide association studies without the need for cell-type composition , 2014, Nature Methods.

[40]  Lars Feuk,et al.  The Database of Genomic Variants: a curated collection of structural variation in the human genome , 2013, Nucleic Acids Res..

[41]  Naoki Orii,et al.  Wiki-Pi: A Web-Server of Annotated Human Protein-Protein Interactions to Aid in Discovery of Protein Function , 2012, PloS one.

[42]  Chao Chen,et al.  dbVar and DGVa: public archives for genomic structural variation , 2012, Nucleic Acids Res..

[43]  Chunquan Li,et al.  CNVD: Text mining‐based copy number variation in disease database , 2012, Human mutation.

[44]  Ji-Hong Kim,et al.  CNVRuler: a copy number variation-based case-control association analysis tool , 2012, Bioinform..

[45]  Peter Langfelder,et al.  Fast R Functions for Robust Correlations and Hierarchical Clustering. , 2012, Journal of statistical software.

[46]  T. Furey,et al.  Integrating genetic and gene expression evidence into genome-wide association analysis of gene sets. , 2011, Genome research.

[47]  Chuhsing Kate Hsiao,et al.  Integrated Analyses of Copy Number Variations and Gene Expression in Lung Adenocarcinoma , 2011, PloS one.

[48]  G. Getz,et al.  GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers , 2011, Genome Biology.

[49]  Michael DiCuccio,et al.  Public data archives for genomic structural variation , 2010, Nature Genetics.

[50]  Tsviya Olender,et al.  GeneCards Version 3: the human gene integrator , 2010, Database J. Biol. Databases Curation.

[51]  Manuel Corpas,et al.  DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. , 2009, American journal of human genetics.

[52]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[53]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[54]  Tomas W. Fitzgerald,et al.  A robust statistical method for case-control association testing with copy number variation , 2008, Nature Genetics.

[55]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[56]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[57]  M. Augustus,et al.  PCGEM1, a prostate-specific gene, is overexpressed in prostate cancer. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[58]  A. Jemal,et al.  Global cancer statistics , 2011, CA: a cancer journal for clinicians.