Mining genetic epidemiology data with Bayesian networks I: Bayesian networks and example application (plasma apoE levels)

MOTIVATION The wealth of single nucleotide polymorphism (SNP) data within candidate genes and anticipated across the genome poses enormous analytical problems for studies of genotype-to-phenotype relationships, and modern data mining methods may be particularly well suited to meet the swelling challenges. In this paper, we introduce the method of Belief (Bayesian) networks to the domain of genotype-to-phenotype analyses and provide an example application. RESULTS A Belief network is a graphical model of a probabilistic nature that represents a joint multivariate probability distribution and reflects conditional independences between variables. Given the data, optimal network topology can be estimated with the assistance of heuristic search algorithms and scoring criteria. Statistical significance of edge strengths can be evaluated using Bayesian methods and bootstrapping. As an example application, the method of Belief networks was applied to 20 SNPs in the apolipoprotein (apo) E gene and plasma apoE levels in a sample of 702 individuals from Jackson, MS. Plasma apoE level was the primary target variable. These analyses indicate that the edge between SNP 4075, coding for the well-known epsilon2 allele, and plasma apoE level was strong. Belief networks can effectively describe complex uncertain processes and can both learn from data and incorporate prior knowledge. AVAILABILITY Various alternative and supplemental networks (not given in the text) as well as source code extensions, are available from the authors. SUPPLEMENTARY INFORMATION http://bioinformatics.oxfordjournals.org.

[1]  E. Davidson,et al.  The hardwiring of development: organization and function of genomic regulatory systems. , 1997, Development.

[2]  David Page,et al.  Modelling regulatory pathways in E. coli from time series expression profiles , 2002, ISMB.

[3]  G. Rubin,et al.  The Role of the Genome Project in Determining Gene Function: Insights from Model Organisms , 1996, Cell.

[4]  R. Mahley,et al.  Abnormal lipoprotein receptor-binding activity of the human E apoprotein due to cysteine-arginine interchange at a single site. , 1982, The Journal of biological chemistry.

[5]  Christopher Meek,et al.  Learning Bayesian Networks with Discrete Variables from Data , 1995, KDD.

[6]  M A Province Sequential methods of analysis for genome scans. , 2001, Advances in genetics.

[7]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[8]  E. Boerwinkle,et al.  Sequence diversity and large-scale typing of SNPs in the human apolipoprotein E gene. , 2000, Genome research.

[9]  Paul J. Krause,et al.  Learning probabilistic networks , 1999, The Knowledge Engineering Review.

[10]  J. Darroch,et al.  A Characterization of the Dirichlet Distribution , 1971 .

[11]  Henry Tirri,et al.  B-Course: A Web-Based Tool for Bayesian and Causal Data Analysis , 2002, Int. J. Artif. Intell. Tools.

[12]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[13]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[14]  E. Boerwinkle,et al.  Apolipoprotein E polymorphism influences postprandial retinyl palmitate but not triglyceride concentrations. , 1994, American journal of human genetics.

[15]  D. Madigan,et al.  Correction to: ``Bayesian model averaging: a tutorial'' [Statist. Sci. 14 (1999), no. 4, 382--417; MR 2001a:62033] , 2000 .

[16]  David Maxwell Chickering,et al.  Learning Bayesian Networks is NP-Complete , 2016, AISTATS.

[17]  Michael A. Province,et al.  30 Sequential methods of analysis for genome scans , 2001 .

[18]  David Maxwell Chickering,et al.  Learning Equivalence Classes of Bayesian Network Structures , 1996, UAI.

[19]  Adrian E. Raftery,et al.  Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors , 1999 .

[20]  J. Ott,et al.  A train of thoughts on gene mapping. , 2001, Theoretical population biology.

[21]  Momiao Xiong,et al.  Generalized T2 test for genome association studies. , 2002, American journal of human genetics.

[22]  E. Boerwinkle,et al.  Simultaneous effects of the apolipoprotein E polymorphism on apolipoprotein E, apolipoprotein B, and cholesterol metabolism. , 1988, American journal of human genetics.

[23]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine Learning.

[24]  Ross D. King,et al.  Application of metabolomics to plant genotype discrimination using statistics and machine learning , 2002, ECCB.

[25]  A. Zharkikh,et al.  Estimation of confidence in phylogeny: the complete-and-partial bootstrap technique. , 1995, Molecular phylogenetics and evolution.

[26]  J. Gilbert,et al.  SNPing away at complex diseases: analysis of single-nucleotide polymorphisms around APOE in Alzheimer disease. , 2000, American journal of human genetics.

[27]  Doug Fisher,et al.  Learning from Data: Artificial Intelligence and Statistics V , 1996 .

[28]  David Heckerman,et al.  A Characterization of the Dirichlet Distribution Through Global and Local Independence , 1994, UAI 1994.

[29]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[30]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[31]  R. Mahley,et al.  Type III hyperlipoproteinemia associated with apolipoprotein E phenotype E3/3. Structure and genetics of an apolipoprotein E3 variant. , 1989, The Journal of clinical investigation.

[32]  Helen Piontkivska,et al.  Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used. , 2004, Molecular phylogenetics and evolution.

[33]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[34]  M. Rosseneu,et al.  The significance of apolipoprotein E structure to the metabolism of plasma triglyceride-rich lipoproteins. , 1994, Biological chemistry Hoppe-Seyler.

[35]  Tommi S. Jaakkola,et al.  Using Graphical Models and Genomic Expression Data to Statistically Validate Models of Genetic Regulatory Networks , 2000, Pacific Symposium on Biocomputing.

[36]  C. Sing,et al.  A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. , 2001, Genome research.