Modeling associations between genetic markers using Bayesian networks

Motivation: Understanding the patterns of association between polymorphisms at different loci in a population (linkage disequilibrium, LD) is of fundamental importance in various genetic studies. Many coefficients were proposed for measuring the degree of LD, but they provide only a static view of the current LD structure. Generative models (GMs) were proposed to go beyond these measures, giving not only a description of the actual LD structure but also a tool to help understanding the process that generated such structure. GMs based in coalescent theory have been the most appealing because they link LD to evolutionary factors. Nevertheless, the inference and parameter estimation of such models is still computationally challenging. Results: We present a more practical method to build GM that describe LD. The method is based on learning weighted Bayesian network structures from haplotype data, extracting equivalence structure classes and using them to model LD. The results obtained in public data from the HapMap database showed that the method is a promising tool for modeling LD. The associations represented by the learned models are correlated with the traditional measure of LD D′. The method was able to represent LD blocks found by standard tools. The granularity of the association blocks and the readability of the models can be controlled in the method. The results suggest that the causality information gained by our method can be useful to tell about the conservability of the genetic markers and to guide the selection of subset of representative markers. Availability: The implementation of the method is available upon request by email. Contact: maciel@sc.usp.br

[1]  Chiara Sabatti,et al.  Homozygosity and linkage disequilibrium. , 2002, Genetics.

[2]  P. Hedrick,et al.  Gametic disequilibrium measures: proceed with caution. , 1987, Genetics.

[3]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[4]  R. Hudson Two-locus sampling distributions and their application. , 2001, Genetics.

[5]  Alun Thomas,et al.  Characterizing allelic associations from unphased diploid data by graphical modeling , 2005, Genetic epidemiology.

[6]  Mark Daly,et al.  Haploview: analysis and visualization of LD and haplotype maps , 2005, Bioinform..

[7]  Michael P. Wellman,et al.  Real-world applications of Bayesian networks , 1995, CACM.

[8]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[9]  J. Kingman Origins of the coalescent. 1974-1982. , 2000, Genetics.

[10]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[11]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine Learning.

[12]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[13]  N. Camp,et al.  Graphical modeling of the joint distribution of alleles at associated loci. , 2004, American journal of human genetics.

[14]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[15]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[16]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[17]  J. Pritchard,et al.  Linkage disequilibrium in humans: models and data. , 2001, American journal of human genetics.

[18]  P. Fearnhead,et al.  A coalescent-based method for detecting and estimating recombination from gene sequences. , 2002, Genetics.

[19]  K. Rohde,et al.  Entropy as a Measure for Linkage Disequilibrium over Multilocus Haplotype Blocks , 2003, Human Heredity.

[20]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[21]  Fengzhu Sun,et al.  A model-based approach to selection of tag SNPs , 2006, BMC Bioinformatics.

[22]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[23]  N. E. Morton,et al.  The first linkage disequilibrium (LD) maps: Delineation of hot and cold blocks by diplotype analysis , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[24]  David Maxwell Chickering,et al.  Learning Equivalence Classes of Bayesian Network Structures , 1996, UAI.

[25]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[26]  S. Tishkoff,et al.  Global Patterns of Linkage Disequilibrium at the CD4 Locus and Modern Human Origins , 1996, Science.

[27]  Lon R. Cardon,et al.  GOLDsurfer: three dimensional display of linkage disequilibrium , 2004, Bioinform..

[28]  Pedro Larrañaga,et al.  Learning Bayesian network structures by searching for the best ordering with genetic algorithms , 1996, IEEE Trans. Syst. Man Cybern. Part A.

[29]  Shili Lin,et al.  Multilocus LD measure and tagging SNP selection with generalized mutual information , 2005, Genetic epidemiology.

[30]  K. Hao,et al.  LdCompare: rapid computation of single- and multiple-marker r2 and genetic coverage , 2007, Bioinform..

[31]  Andrew Collins,et al.  Impact of population structure, effective bottleneck time, and allele frequency on linkage disequilibrium maps , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Jakob C. Mueller,et al.  Linkage disequilibrium for different scales and applications , 2004, Briefings Bioinform..

[33]  Lei Zhang,et al.  A multilocus linkage disequilibrium measure based on mutual information theory and its applications , 2009, Genetica.

[34]  David Maxwell Chickering,et al.  Large-Sample Learning of Bayesian Networks is NP-Hard , 2002, J. Mach. Learn. Res..