Capturing the Spectrum of Interaction Effects in Genetic Association Studies by Simulated Evaporative Cooling Network Analysis

Evidence from human genetic studies of several disorders suggests that interactions between alleles at multiple genes play an important role in influencing phenotypic expression. Analytical methods for identifying Mendelian disease genes are not appropriate when applied to common multigenic diseases, because such methods investigate association with the phenotype only one genetic locus at a time. New strategies are needed that can capture the spectrum of genetic effects, from Mendelian to multifactorial epistasis. Random Forests (RF) and Relief-F are two powerful machine-learning methods that have been studied as filters for genetic case-control data due to their ability to account for the context of alleles at multiple genes when scoring the relevance of individual genetic variants to the phenotype. However, when variants interact strongly, the independence assumption of RF in the tree node-splitting criterion leads to diminished importance scores for relevant variants. Relief-F, on the other hand, was designed to detect strong interactions but is sensitive to large backgrounds of variants that are irrelevant to classification of the phenotype, which is an acute problem in genome-wide association studies. To overcome the weaknesses of these data mining approaches, we develop Evaporative Cooling (EC) feature selection, a flexible machine learning method that can integrate multiple importance scores while removing irrelevant genetic variants. To characterize detailed interactions, we construct a genetic-association interaction network (GAIN), whose edges quantify the synergy between variants with respect to the phenotype. We use simulation analysis to show that EC is able to identify a wide range of interaction effects in genetic association data. We apply the EC filter to a smallpox vaccine cohort study of single nucleotide polymorphisms (SNPs) and infer a GAIN for a collection of SNPs associated with adverse events. Our results suggest an important role for hubs in SNP disease susceptibility networks. The software is available at http://sites.google.com/site/McKinneyLab/software.

[1]  David M. Reif,et al.  Machine Learning for Detecting Gene-Gene Interactions: A Review , 2011 .

[2]  Sanjoy Dasgupta,et al.  Adaptive Control Processes , 2010, Encyclopedia of Machine Learning and Data Mining.

[3]  David M. Reif,et al.  Genetic basis for adverse events after smallpox vaccination. , 2008, The Journal of infectious diseases.

[4]  Mee Young Park,et al.  Penalized logistic regression for detecting gene interactions. , 2008, Biostatistics.

[5]  Kari Stefansson,et al.  Common Sequence Variants in the LOXL1 Gene Confer Susceptibility to Exfoliation Glaucoma , 2007, Science.

[6]  S. Gabriel,et al.  Risk alleles for multiple sclerosis identified by a genomewide study. , 2007, The New England journal of medicine.

[7]  Jason H. Moore,et al.  Evaporative cooling feature selection for genotypic data involving interactions , 2007, Bioinform..

[8]  Eric E. Smith,et al.  Variants conferring risk of atrial fibrillation on chromosome 4q25 , 2007, Nature.

[9]  Winnie S. Liang,et al.  GAB2 Alleles Modify Alzheimer's Risk in APOE ɛ4 Carriers , 2007, Neuron.

[10]  W. Willett,et al.  A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer , 2007, Nature Genetics.

[11]  Jason H. Moore,et al.  Tuning ReliefF for Genome-Wide Genetic Analysis , 2007, EvoBIO.

[12]  D. Allison,et al.  Detection of gene x gene interactions in genome-wide association studies of human population data. , 2007, Human heredity.

[13]  Winnie S. Liang,et al.  GAB2 alleles modify Alzheimer's risk in APOE epsilon4 carriers. , 2007, Neuron.

[14]  Todd Holden,et al.  A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. , 2006, Journal of theoretical biology.

[15]  A. G. Heidema,et al.  The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases , 2006, BMC Genetics.

[16]  Marylyn D. Ritchie,et al.  Data Simulation Software for Whole-Genome Association and Other Studies in Human Genetics , 2005, Pacific Symposium on Biocomputing.

[17]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[18]  G. Church,et al.  Modular epistasis in yeast metabolism , 2005, Nature Genetics.

[19]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[20]  Jonathan L Haines,et al.  Genetics, statistics and human disease: analytical retooling for complexity. , 2004, Trends in genetics : TIG.

[21]  Marko Robnik-Sikonja,et al.  Improving Random Forests , 2004, ECML.

[22]  Dino Pedreschi,et al.  Machine Learning: ECML 2004 , 2004, Lecture Notes in Computer Science.

[23]  Chris S. Haley,et al.  Epistasis: too often neglected in complex trait studies? , 2004, Nature Reviews Genetics.

[24]  Scott M. Williams,et al.  The use of animal models in the study of complex disease: all else is never equal or why do so many human studies fail to replicate animal findings? , 2004, BioEssays : news and reviews in molecular, cellular and developmental biology.

[25]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[26]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[27]  Ivan Bratko,et al.  Attribute Interactions in Medical Data Analysis , 2003, AIME.

[28]  J. Ott,et al.  Mathematical multi-locus approaches to localizing complex human trait genes , 2003, Nature Reviews Genetics.

[29]  Bruce A. Draper,et al.  Iterative Relief , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[30]  Minerva M. Carrasquillo,et al.  Genome-wide association study and mouse model identify interaction between RET and EDNRB pathways in Hirschsprung disease , 2002, Nature Genetics.

[31]  H. Cordell Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. , 2002, Human molecular genetics.

[32]  J. Hirschhorn,et al.  A comprehensive review of genetic association studies , 2002, Genetics in Medicine.

[33]  T. Reich,et al.  A perspective on epistasis: limits of models displaying no main effect. , 2002, American journal of human genetics.

[34]  T. Mackay The genetic architecture of quantitative traits. , 2001, Annual review of genetics.

[35]  L. Palmer,et al.  Genomewide scans of complex human diseases: true linkage is hard to find. , 2001, American journal of human genetics.

[36]  Jae Hyun Kim,et al.  Genetic analysis of a new mouse model for non-insulin-dependent diabetes. , 2001, Genomics.

[37]  G A Churchill,et al.  Genome-wide epistatic interaction analysis reveals complex genetic determinants of circadian behavior in mice. , 2001, Genome research.

[38]  J. Cheverud,et al.  Epistasis and its contribution to genetic variance components. , 1995, Genetics.

[39]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[40]  Toshio Odanaka,et al.  ADAPTIVE CONTROL PROCESSES , 1990 .

[41]  Hess,et al.  Evaporative cooling of magnetically trapped and compressed spin-polarized hydrogen. , 1986, Physical review. B, Condensed matter.

[42]  R. Bellman,et al.  V. Adaptive Control Processes , 1964 .

[43]  W. J. McGill Multivariate information transmission , 1954, Trans. IRE Prof. Group Inf. Theory.