Learning a forest of Hierarchical Bayesian Networks to model dependencies between genetic markers

We propose a novel probabilistic graphical model dedicated to represent the statistical dependencies between genetic markers, in the Human genome. Our proposal relies on building a forest of hierarchical latent class models. It is able to account for both local and higher-order dependencies between markers. Our motivation is to reduce the dimension of the data to be further submitted to statistical association tests with respect to diseased/non diseased status. A generic algorithm, CFHLC, has been designed to tackle the learning of both forest structure and probability distributions. A first implementation of CFHLC has been shown to be tractable on benchmarks describing 100000 variables for 2000 individuals, on a standard personal computer.

[1]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[2]  Tao Chen,et al.  Latent Tree Models and Approximate Inference in Bayesian Networks , 2008, AAAI.

[3]  E. Eskin,et al.  Efficient Association Study Design Via Power‐Optimized Tag SNP Selection , 2008, Annals of human genetics.

[4]  Arpad Kelemen,et al.  Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases , 2008, 0803.4065.

[5]  S. P. Fodor,et al.  Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21 , 2001, Science.

[6]  Nevin Lianwen Zhang,et al.  Hierarchical latent class models for cluster analysis , 2002, J. Mach. Learn. Res..

[7]  D. Balding A tutorial on statistical methods for population association studies , 2006, Nature Reviews Genetics.

[8]  Ara V. Neflan LEARNING SNP DEPENDENCIES USING EMBEDDED BAYESIAN NETWORKS , 2006 .

[9]  Ingo Ruczinski,et al.  Haplotype block partitioning as a tool for dimensionality reduction in SNP association studies , 2008, BMC Genomics.

[10]  Nevin L. Zhang Structural EM for hierarchical latent class models , 2003 .

[11]  D. Schaid Evaluating associations of haplotypes with traits , 2004, Genetic epidemiology.

[12]  Yulong Zhang,et al.  Clustering of SNPs by a Structural EM Algorithm , 2009, 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing.

[13]  Ara V. Nefian LEARNING SNP DEPENDENCIES USING EMBEDDED BAYESIAN NETWORKS , 2006 .

[14]  Byoung-Tak Zhang,et al.  Learning Hierarchical Bayesian Networks for Large-Scale Data Analysis , 2006, ICONIP.

[15]  Claudio J. Verzilli,et al.  Bayesian graphical models for genomewide association studies. , 2006, American journal of human genetics.