A fast algorithm for learning epistatic genomic relationships.

Genetic epidemiologists strive to determine the genetic profile of diseases. Epistasis is the interaction between two or more genes to affect phenotype. Due to the often non-linearity of the interaction, it is difficult to detect statistical patterns of epistasis. Combinatorial methods for detecting epistasis investigate a subset of combinations of genes without employing a search strategy. Therefore, they do not scale to handling the high-dimensional data found in genome-wide association studies (GWAS). We represent genome-phenome interactions using a Bayesian network rule, which is a specialized Bayesian network. We develop an efficient search algorithm to learn from data a high scoring rule that may contain two or more interacting genes. Our experimental results using synthetic data indicate that this algorithm detects interacting genes as well as a Bayesian network combinatorial method, and it is much faster. Our results also indicate that the algorithm can successfully learn genome-phenome relationships using a real GWAS dataset.

[1]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[2]  A. G. Heidema,et al.  The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases , 2006, BMC Genetics.

[3]  Winnie S. Liang,et al.  GAB2 Alleles Modify Alzheimer's Risk in APOE ɛ4 Carriers , 2007, Neuron.

[4]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[5]  Scott M. Williams,et al.  A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction , 2007, Genetic epidemiology.

[6]  Nick C Fox,et al.  Genome-wide association study identifies variants at CLU and PICALM associated with Alzheimer's disease, and shows evidence for additional susceptibility genes , 2009, Nature Genetics.

[7]  David Maxwell Chickering,et al.  Finding Optimal Bayesian Networks , 2002, UAI.

[8]  Gregory F. Cooper,et al.  A Bayesian Method for the Induction of Probabilistic Networks from Data , 1992 .

[9]  Jason H. Moore,et al.  An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene Interactions on risk of myocardial infarction: The importance of model validation , 2004, BMC Bioinformatics.

[10]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[11]  Kevin Morgan,et al.  Analysis of Genome-Wide Association Study (GWAS) data looking for replicating signals in Alzheimer's disease (AD). , 2010, International journal of molecular epidemiology and genetics.

[12]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[13]  John P A Ioannidis,et al.  Beyond genome-wide association studies: genetic heterogeneity and individual predisposition to cancer. , 2010, Trends in genetics : TIG.

[14]  Shyam Visweswaran,et al.  A Bayesian Method for Identifying Genetic Interactions , 2009, AMIA.

[15]  J. Suzuki Learning Bayesian Belief Networks Based on the Minimum Description Length Principle: Basic Properties , 1999 .

[16]  Rebecca F. Halperin,et al.  A high-density whole-genome association study reveals that APOE is the major susceptibility gene for sporadic late-onset Alzheimer's disease. , 2007, The Journal of clinical psychiatry.

[17]  David V Conti,et al.  Identifying susceptibility genes by using joint tests of association and linkage and accounting for epistasis , 2005, BMC Genetics.

[18]  P. Bosco,et al.  Genome-wide association study identifies variants at CLU and CR1 associated with Alzheimer's disease , 2009, Nature Genetics.

[19]  Kevin B. Korb,et al.  Bayesian Artificial Intelligence , 2004, Computer science and data analysis series.

[20]  Mee Young Park,et al.  Penalized logistic regression for detecting gene interactions. , 2008, Biostatistics.