A Markov blanket-based method for detecting causal SNPs in GWAS

BackgroundDetecting epistatic interactions associated with complex and common diseases can help to improve prevention, diagnosis and treatment of these diseases. With the development of genome-wide association studies (GWAS), designing powerful and robust computational method for identifying epistatic interactions associated with common diseases becomes a great challenge to bioinformatics society, because the study of epistatic interactions often deals with the large size of the genotyped data and the huge amount of combinations of all the possible genetic factors. Most existing computational detection methods are based on the classification capacity of SNP sets, which may fail to identify SNP sets that are strongly associated with the diseases and introduce a lot of false positives. In addition, most methods are not suitable for genome-wide scale studies due to their computational complexity.ResultsWe propose a new Markov Blanket-based method, DASSO-MB (Detection of ASSOciations using Markov Blanket) to detect epistatic interactions in case-control GWAS. Markov blanket of a target variable T can completely shield T from all other variables. Thus, we can guarantee that the SNP set detected by DASSO-MB has a strong association with diseases and contains fewest false positives. Furthermore, DASSO-MB uses a heuristic search strategy by calculating the association between variables to avoid the time-consuming training process as in other machine-learning methods. We apply our algorithm to simulated datasets and a real case-control dataset. We compare DASSO-MB to other commonly-used methods and show that our method significantly outperforms other methods and is capable of finding SNPs strongly associated with diseases.ConclusionsOur study shows that DASSO-MB can identify a minimal set of causal SNPs associated with diseases, which contains less false positives compared to other existing methods. Given the huge size of genomic dataset produced by GWAS, this is critical in saving the potential costs of biological experiments and being an efficient guideline for pathogenesis research.

[1]  Jing Li,et al.  Generating samples for association studies based on HapMap data , 2008, BMC Bioinformatics.

[2]  D. Allison,et al.  Detection of gene x gene interactions in genome-wide association studies of human population data. , 2007, Human heredity.

[3]  F. James Rohlf,et al.  Biometry: The Principles and Practice of Statistics in Biological Research , 1969 .

[4]  Todd Holden,et al.  A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. , 2006, Journal of theoretical biology.

[5]  Lester L. Peters,et al.  Genome-wide association study identifies novel breast cancer susceptibility loci , 2007, Nature.

[6]  Rui Mei,et al.  Large-scale SNP analysis reveals clustered and continuous patterns of human genetic variation , 2005, Human Genomics.

[7]  Sokal Rr,et al.  Biometry: the principles and practice of statistics in biological research 2nd edition. , 1981 .

[8]  Shyh-Huei Chen,et al.  A support vector machine approach for detecting gene‐gene interaction , 2008, Genetic epidemiology.

[9]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[10]  H. Cordell Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. , 2002, Human molecular genetics.

[11]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[12]  J. Pritchard,et al.  Linkage disequilibrium in humans: models and data. , 2001, American journal of human genetics.

[13]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[14]  Constantin F. Aliferis,et al.  Time and sample efficient discovery of Markov blankets and direct causal relations , 2003, KDD '03.

[15]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[16]  P. Spirtes,et al.  Causation, Prediction, and Search, 2nd Edition , 2001 .

[17]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[18]  Mee Young Park,et al.  Penalized logistic regression for detecting gene interactions. , 2008, Biostatistics.

[19]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[20]  Tom Burr,et al.  Causation, Prediction, and Search , 2003, Technometrics.

[21]  Jason H. Moore,et al.  Power of multifactor dimensionality reduction for detecting gene‐gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity , 2003, Genetic epidemiology.

[22]  Jason H. Moore,et al.  Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions , 2003, Bioinform..

[23]  Sebastian Thrun,et al.  Bayesian Network Induction via Local Neighborhoods , 1999, NIPS.

[24]  J. Ott,et al.  Complement Factor H Polymorphism in Age-Related Macular Degeneration , 2005, Science.

[25]  Stylianos E. Antonarakis,et al.  Mendelian disorders deserve more attention , 2006, Nature Reviews Genetics.

[26]  Jacques Fellay,et al.  A Whole-Genome Association Study of Major Determinants for Host Control of HIV-1 , 2007, Science.

[27]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[28]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[29]  Constantin F. Aliferis,et al.  HITON: A Novel Markov Blanket Algorithm for Optimal Variable Selection , 2003, AMIA.

[30]  Rui Jiang,et al.  A random forest approach to the detection of epistatic interactions in case-control studies , 2009, BMC Bioinformatics.

[31]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[32]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[33]  Jun S. Liu,et al.  Bayesian inference of epistatic interactions in case-control studies , 2007, Nature Genetics.

[34]  David M. Reif,et al.  Machine Learning for Detecting Gene-Gene Interactions , 2006, Applied bioinformatics.

[35]  Ingo Ruczinski,et al.  Identifying interacting SNPs using Monte Carlo logic regression , 2005, Genetic epidemiology.

[36]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2004, Nucleic Acids Res..

[37]  Constantin F. Aliferis,et al.  Algorithms for Large Scale Markov Blanket Discovery , 2003, FLAIRS.

[38]  Xue-wen Chen,et al.  Improving Bayesian Network Structure Learning with Mutual Information-Based Node Ordering in the K2 Algorithm , 2008, IEEE Transactions on Knowledge and Data Engineering.