Genetic studies of complex human diseases: Characterizing SNP-disease associations using Bayesian networks

BackgroundDetecting epistatic interactions plays a significant role in improving pathogenesis, prevention, diagnosis, and treatment of complex human diseases. Applying machine learning or statistical methods to epistatic interaction detection will encounter some common problems, e.g., very limited number of samples, an extremely high search space, a large number of false positives, and ways to measure the association between disease markers and the phenotype.ResultsTo address the problems of computational methods in epistatic interaction detection, we propose a score-based Bayesian network structure learning method, EpiBN, to detect epistatic interactions. We apply the proposed method to both simulated datasets and three real disease datasets. Experimental results on simulation data show that our method outperforms some other commonly-used methods in terms of power and sample-efficiency, and is especially suitable for detecting epistatic interactions with weak or no marginal effects. Furthermore, our method is scalable to real disease data.ConclusionsWe propose a Bayesian network-based method, EpiBN, to detect epistatic interactions. In EpiBN, we develop a new scoring function, which can reflect higher-order epistatic interactions by estimating the model complexity from data, and apply a fast Branch-and-Bound algorithm to learn the structure of a two-layer Bayesian network containing only one target node. To make our method scalable to real data, we propose the use of a Markov chain Monte Carlo (MCMC) method to perform the screening process. Applications of the proposed method to some real GWAS (genome-wide association studies) datasets may provide helpful insights into understanding the genetic basis of Age-related Macular Degeneration, late-onset Alzheimer's disease, and autism.

[1]  David Maxwell Chickering,et al.  Large-Sample Learning of Bayesian Networks is NP-Hard , 2002, J. Mach. Learn. Res..

[2]  Bradley P. Carlin,et al.  Bayesian measures of model complexity and fit , 2002 .

[3]  A. Addington,et al.  Novel Autism Subtype-Dependent Genetic Variants Are Revealed by Quantitative Trait and Subphenotype Association Analyses of Published GWAS Data , 2011, PloS one.

[4]  Shyh-Huei Chen,et al.  A support vector machine approach for detecting gene‐gene interaction , 2008, Genetic epidemiology.

[5]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[6]  Shyam Visweswaran,et al.  Learning genetic epistasis using Bayesian network scoring criteria , 2011, BMC Bioinformatics.

[7]  Robert T. Schultz,et al.  Common genetic variants on 5p14.1 associate with autism spectrum disorders , 2009, Nature.

[8]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[9]  Jayaram Raghuram,et al.  Comparative analysis of methods for detecting interacting loci , 2011, BMC Genomics.

[10]  Joe Suzuki,et al.  Learning Bayesian Belief Networks Based on the Minimum Description Length Principle: An Efficient Algorithm Using the B & B Technique , 1996, ICML.

[11]  Jun S. Liu,et al.  Bayesian inference of epistatic interactions in case-control studies , 2007, Nature Genetics.

[12]  Valerie W. Hu,et al.  Novel clustering of items from the Autism Diagnostic Interview‐Revised to define phenotypes within autism spectrum disorders , 2009, Autism research : official journal of the International Society for Autism Research.

[13]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[14]  Zohreh Talebizadeh,et al.  Autism genetic database (AGD): a comprehensive database including autism susceptibility gene-CNVs integrated with known noncoding RNAs and fragile sites , 2009, BMC Medical Genetics.

[15]  Paolo Giudici,et al.  Improving Markov Chain Monte Carlo Model Search for Data Mining , 2004, Machine Learning.

[16]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[17]  Xue-wen Chen An improved branch and bound algorithm for feature selection , 2003, Pattern Recognit. Lett..

[18]  Louise Arnold,et al.  University of Missouri-Kansas City School of Medicine. , 2010, Academic medicine : journal of the Association of American Medical Colleges.

[19]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[20]  Constantin F. Aliferis,et al.  Algorithms for Large Scale Markov Blanket Discovery , 2003, FLAIRS.

[21]  Constantin F. Aliferis,et al.  Causal Explorer: A Causal Probabilistic Network Learning Toolkit for Biomedical Discovery , 2003, METMBS.

[22]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[23]  Sokal Rr,et al.  Biometry: the principles and practice of statistics in biological research 2nd edition. , 1981 .

[24]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[25]  Rui Jiang,et al.  A random forest approach to the detection of epistatic interactions in case-control studies , 2009, BMC Bioinformatics.

[26]  Xue-wen Chen,et al.  Improving Bayesian Network Structure Learning with Mutual Information-Based Node Ordering in the K2 Algorithm , 2008, IEEE Transactions on Knowledge and Data Engineering.

[27]  Xue-wen Chen,et al.  A Markov blanket-based method for detecting causal SNPs in GWAS , 2010, BMC Bioinformatics.

[28]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[29]  Mee Young Park,et al.  Penalized logistic regression for detecting gene interactions. , 2008, Biostatistics.

[30]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[31]  E. Tobias,et al.  The TES gene at 7q31.1 is methylated in tumours and encodes a novel growth-suppressing LIM domain protein , 2001, Oncogene.

[32]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[33]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[34]  D. Blacker,et al.  Systematic meta-analyses of Alzheimer disease genetic association studies: the AlzGene database , 2007, Nature Genetics.

[35]  Jesper Tegnér,et al.  Towards scalable and data efficient learning of Markov boundaries , 2007, Int. J. Approx. Reason..

[36]  J. Lupski,et al.  Cosegregation and functional analysis of mutant ABCR (ABCA4) alleles in families that manifest both Stargardt disease and age-related macular degeneration. , 2001, Human molecular genetics.

[37]  P. Spirtes,et al.  Causation, Prediction, and Search, 2nd Edition , 2001 .

[38]  H. Akaike A new look at the statistical model identification , 1974 .

[39]  Maomi Ueno,et al.  Learning networks determined by the ratio of prior and data , 2010, UAI.

[40]  Seth Blackshaw,et al.  Mutations in the inosine monophosphate dehydrogenase 1 gene (IMPDH1) cause the RP10 form of autosomal dominant retinitis pigmentosa. , 2002, Human molecular genetics.

[41]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine Learning.

[42]  D. Allison,et al.  Detection of gene x gene interactions in genome-wide association studies of human population data. , 2007, Human heredity.

[43]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[44]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[45]  Winnie S. Liang,et al.  GAB2 alleles modify Alzheimer's risk in APOE epsilon4 carriers. , 2007, Neuron.

[46]  Luis M. de Campos,et al.  A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests , 2006, J. Mach. Learn. Res..

[47]  J. Ott,et al.  Complement Factor H Polymorphism in Age-Related Macular Degeneration , 2005, Science.

[48]  Winnie S. Liang,et al.  GAB2 Alleles Modify Alzheimer's Risk in APOE ɛ4 Carriers , 2007, Neuron.

[49]  F. James Rohlf,et al.  Biometry: The Principles and Practice of Statistics in Biological Research , 1969 .

[50]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[51]  Constantin F. Aliferis,et al.  Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part II: Analysis and Extensions , 2010, J. Mach. Learn. Res..

[52]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[53]  Xinkun Wang,et al.  An effective structure learning method for constructing gene networks , 2006, Bioinform..

[54]  Milan Studený,et al.  Probabilistic conditional independence structures , 2006, Information science and statistics.

[55]  Sandhya Rani,et al.  Human Protein Reference Database—2009 update , 2008, Nucleic Acids Res..

[56]  A. Couteur,et al.  Autism Diagnostic Interview-Revised: A revised version of a diagnostic interview for caregivers of individuals with possible pervasive developmental disorders , 1994, Journal of autism and developmental disorders.