Identification of genetic interaction networks via an evolutionary algorithm evolved Bayesian network

BackgroundThe future of medicine is moving towards the phase of precision medicine, with the goal to prevent and treat diseases by taking inter-individual variability into account. A large part of the variability lies in our genetic makeup. With the fast paced improvement of high-throughput methods for genome sequencing, a tremendous amount of genetics data have already been generated. The next hurdle for precision medicine is to have sufficient computational tools for analyzing large sets of data. Genome-Wide Association Studies (GWAS) have been the primary method to assess the relationship between single nucleotide polymorphisms (SNPs) and disease traits. While GWAS is sufficient in finding individual SNPs with strong main effects, it does not capture potential interactions among multiple SNPs. In many traits, a large proportion of variation remain unexplained by using main effects alone, leaving the door open for exploring the role of genetic interactions. However, identifying genetic interactions in large-scale genomics data poses a challenge even for modern computing.ResultsFor this study, we present a new algorithm, Grammatical Evolution Bayesian Network (GEBN) that utilizes Bayesian Networks to identify interactions in the data, and at the same time, uses an evolutionary algorithm to reduce the computational cost associated with network optimization. GEBN excelled in simulation studies where the data contained main effects and interaction effects. We also applied GEBN to a Type 2 diabetes (T2D) dataset obtained from the Marshfield Personalized Medicine Research Project (PMRP). We were able to identify genetic interactions for T2D cases and controls and use information from those interactions to classify T2D samples. We obtained an average testing area under the curve (AUC) of 86.8 %. We also identified several interacting genes such as INADL and LPP that are known to be associated with T2D.ConclusionsDeveloping the computational tools to explore genetic associations beyond main effects remains a critically important challenge in human genetics. Methods, such as GEBN, demonstrate the utility of considering genetic interactions, as they likely explain some of the missing heritability.

[1]  K. Sachs,et al.  Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data , 2005, Science.

[2]  Richard A. Gibbs,et al.  Novel Genetic Loci Identified for the Pathophysiology of Childhood Obesity in the Hispanic Population , 2012, PloS one.

[3]  C. McCarty,et al.  Marshfield Clinic Personalized Medicine Research Project (PMRP): design, methods and recruitment for a large population-based biobank. , 2005, Personalized medicine.

[4]  Jason H. Moore,et al.  Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions , 2003, Bioinform..

[5]  I. Kanazawa,et al.  Genetic association of CTNNA3 with late-onset Alzheimer's disease in females. , 2007, Human molecular genetics.

[6]  Jason H. Moore,et al.  Spatially Uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions , 2009, BioData Mining.

[7]  B. Maher Personal genomes: The case of the missing heritability , 2008, Nature.

[8]  Michael O'Neill,et al.  Grammatical evolution - evolutionary automatic programming in an arbitrary language , 2003, Genetic programming.

[9]  Nir Friedman,et al.  Inferring Cellular Networks Using Probabilistic Graphical Models , 2004, Science.

[10]  M. Ritchie,et al.  Methods of integrating data to uncover genotype–phenotype interactions , 2015, Nature Reviews Genetics.

[11]  George Varghese,et al.  Using Genome Query Language to uncover genetic variation , 2014, Bioinform..

[12]  A. Bulpitt,et al.  Insights into protein-protein interfaces using a Bayesian network prediction method. , 2006, Journal of molecular biology.

[13]  Marylyn D. Ritchie,et al.  ATHENA: the analysis tool for heritable and environmental network associations , 2014, Bioinform..

[14]  Anna L. Gloyn,et al.  Type 2 Diabetes Susceptibility Gene TCF7L2 and Its Role in β-Cell Function , 2009, Diabetes.

[15]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[16]  G. Cooper,et al.  An efficient bayesian method for predicting clinical outcomes from genome-wide data. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[17]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[18]  Tanya M. Teslovich,et al.  Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility , 2014, Nature Genetics.

[19]  Jon Doyle,et al.  Bayesian neural networks for detecting epistasis in genetic association studies , 2014, BMC Bioinformatics.

[20]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine Learning.

[21]  Gerard Tromp,et al.  Biology-Driven Gene-Gene Interaction Analysis of Age-Related Cataract in the eMERGE Network , 2015, Genetic epidemiology.

[22]  S. R. Kulkarni,et al.  Common variants in the TCF7L2 gene are strongly associated with type 2 diabetes mellitus in the Indian population , 2006, Diabetologia.

[23]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[24]  Xue-wen Chen,et al.  A Markov blanket-based method for detecting causal SNPs in GWAS , 2010, BMC Bioinformatics.

[25]  Marylyn D. Ritchie,et al.  ATHENA: A Tool for Meta-Dimensional Analysis Applied to Genotypes and Gene Expression Data to Predict HDL Cholesterol Levels , 2012, Pacific Symposium on Biocomputing.

[26]  Conor Ryan,et al.  Survey Of Evolutionary Automatic Programming , 2003 .

[27]  Conor Ryan,et al.  Grammatical evolution , 2007, GECCO '07.

[28]  Paco Hulpiau,et al.  Mutations in the area composita protein αT-catenin are associated with arrhythmogenic right ventricular cardiomyopathy. , 2013, European heart journal.

[29]  Xin Wang,et al.  SNP interaction detection with Random Forests in high-dimensional genetic data , 2012, BMC Bioinformatics.

[30]  M. Barmada,et al.  Identifying genetic interactions in genome‐wide data using Bayesian networks , 2010, Genetic epidemiology.

[31]  Suzette J. Bielinski,et al.  Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study , 2012, J. Am. Medical Informatics Assoc..

[32]  F. Collins,et al.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits , 2009, Proceedings of the National Academy of Sciences.

[33]  E. Mardis Next-generation DNA sequencing methods. , 2008, Annual review of genomics and human genetics.

[34]  Xiaoxiang Hu,et al.  Lats2 Modulates Adipocyte Proliferation and Differentiation via Hippo Signaling , 2013, PloS one.

[35]  George Hripcsak,et al.  Development and validation of an electronic phenotyping algorithm for chronic kidney disease , 2014, AMIA.

[36]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[37]  N. Sarvetnick,et al.  Hippo Signaling Regulates Pancreas Development through Inactivation of Yap , 2012, Molecular and Cellular Biology.

[38]  Melissa A. Basford,et al.  Genome- and Phenome-Wide Analyses of Cardiac Conduction Identifies Markers of Arrhythmia Risk , 2013, Circulation.