A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility.

Detecting, characterizing, and interpreting gene-gene interactions or epistasis in studies of human disease susceptibility is both a mathematical and a computational challenge. To address this problem, we have previously developed a multifactor dimensionality reduction (MDR) method for collapsing high-dimensional genetic data into a single dimension (i.e. constructive induction) thus permitting interactions to be detected in relatively small sample sizes. In this paper, we describe a comprehensive and flexible framework for detecting and interpreting gene-gene interactions that utilizes advances in information theory for selecting interesting single-nucleotide polymorphisms (SNPs), MDR for constructive induction, machine learning methods for classification, and finally graphical models for interpretation. We illustrate the usefulness of this strategy using artificial datasets simulated from several different two-locus and three-locus epistasis models. We show that the accuracy, sensitivity, specificity, and precision of a naïve Bayes classifier are significantly improved when SNPs are selected based on their information gain (i.e. class entropy removed) and reduced to a single attribute using MDR. We then apply this strategy to detecting, characterizing, and interpreting epistatic models in a genetic study (n = 500) of atrial fibrillation and show that both classification and model interpretation are significantly improved.

[1]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[2]  J. Pierce An introduction to information theory: symbols, signals & noise , 1980 .

[3]  Marylyn D. Ritchie,et al.  Multilocus Analysis of Hypertension: A Hierarchical Approach , 2004, Human Heredity.

[4]  Scott M. Williams,et al.  Traversing the conceptual divide between biological and statistical epistasis: systems biology and a more modern synthesis. , 2005, BioEssays : news and reviews in molecular, cellular and developmental biology.

[5]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[6]  Scott M. Williams,et al.  New strategies for identifying gene-gene interactions in hypertension , 2002, Annals of medicine.

[7]  Garth A. Gibson,et al.  Canalization in evolutionary genetics: a stabilizing theory? , 2000, BioEssays : news and reviews in molecular, cellular and developmental biology.

[8]  Ivan Bratko,et al.  Microarray data mining with visual programming , 2005, Bioinform..

[9]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[10]  E. Hill Journal of Theoretical Biology , 1961, Nature.

[11]  D. Clayton,et al.  Genome-wide association studies: theoretical and practical concerns , 2005, Nature Reviews Genetics.

[12]  Alex Bateman,et al.  The TROVE module: A common element in Telomerase, Ro and Vault ribonucleoproteins , 2003, BMC Bioinformatics.

[13]  S. Counce The Strategy of the Genes , 1958, The Yale Journal of Biology and Medicine.

[14]  Jason H. Moore,et al.  The Interaction of Four Genes in the Inflammation Pathway Significantly Predicts Prostate Cancer Risk , 2005, Cancer Epidemiology Biomarkers & Prevention.

[15]  E. Boczko,et al.  Connecting the dots between genes, biochemistry, and disease susceptibility: systems biology modeling in human genetics. , 2003, Molecular genetics and metabolism.

[16]  Ryszard S. Michalski,et al.  A theory and methodology of inductive learning , 1993 .

[17]  Michael J. Wade,et al.  Epistasis, complex traits, and mapping genes , 2004, Genetica.

[18]  Jurg Ott,et al.  Genetic dissection of diseases: design and methods. , 2004, Current opinion in genetics & development.

[19]  Lin He,et al.  An association study of the N-methyl-D-aspartate receptor NR1 subunit gene (GRIN1) and NR2B subunit gene (GRIN2B) in schizophrenia with universal DNA microarray , 2005, European Journal of Human Genetics.

[20]  Bill C White,et al.  Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases , 2003, BMC Bioinformatics.

[21]  John R. Anderson,et al.  MACHINE LEARNING An Artificial Intelligence Approach , 2009 .

[22]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[23]  Ivan Bratko,et al.  Attribute Interactions in Medical Data Analysis , 2003, AIME.

[24]  R. Fisher XV.—The Correlation between Relatives on the Supposition of Mendelian Inheritance. , 1919, Transactions of the Royal Society of Edinburgh.

[25]  W. Bateson Mendel's Principles of Heredity , 1910, Nature.

[26]  Jason H. Moore,et al.  STUDENTJAMA. The challenges of whole-genome approaches to common diseases. , 2004, JAMA.

[27]  Marylyn D Ritchie,et al.  Renin-Angiotensin System Gene Polymorphisms and Atrial Fibrillation , 2004, Circulation.

[28]  G. Church,et al.  Modular epistasis in yeast metabolism , 2005, Nature Genetics.

[29]  Ryszard S. Michalski,et al.  Hypothesis-Driven Constructive Induction in AQ17-HCI: A Method and Experiments , 1994, Machine Learning.

[30]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[31]  David B Allison,et al.  "Are we there yet?": Deciding when one has demonstrated specific genetic causation in complex diseases and quantitative traits. , 2003, American journal of human genetics.

[32]  L. Penrose,et al.  THE CORRELATION BETWEEN RELATIVES ON THE SUPPOSITION OF MENDELIAN INHERITANCE , 2022 .

[33]  Jonathan L Haines,et al.  Genetics, statistics and human disease: analytical retooling for complexity. , 2004, Trends in genetics : TIG.

[34]  J. Stengård,et al.  Genes, Environment, and Cardiovascular Disease , 2003, Arteriosclerosis, thrombosis, and vascular biology.

[35]  Ivan Bratko,et al.  Feature Transformation by Function Decomposition , 1998, IEEE Intell. Syst..

[36]  Hyung-Suk Kim,et al.  Minireview: computer simulations of blood pressure regulation by the renin-angiotensin system. , 2003, Endocrinology.

[37]  Jason H. Moore,et al.  Ideal discrimination of discrete clinical endpoints using multilocus genotypes , 2004, Silico Biol..

[38]  Ivan Bratko,et al.  Quantifying and Visualizing Attribute Interactions: An Approach Based on Entropy , 2003 .

[39]  Ian H. Witten,et al.  Data mining in bioinformatics using Weka , 2004, Bioinform..

[40]  Wentian Li,et al.  A Complete Enumeration and Classification of Two-Locus Disease Models , 1999, Human Heredity.

[41]  M. Wade,et al.  Epistasis and the Evolutionary Process , 2000 .

[42]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[43]  Ian Witten,et al.  Data Mining , 2000 .

[44]  Jason H. Moore,et al.  An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene Interactions on risk of myocardial infarction: The importance of model validation , 2004, BMC Bioinformatics.

[45]  Yuh-Jyh Hu Constructive Induction: Covering Attribute Spectrum , 1998 .

[46]  Jason H. Moore,et al.  Genetic Programming Neural Networks as a Bioinformatics Tool for Human Genetics , 2004, GECCO.

[47]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[48]  William J. McGill Multivariate information transmission , 1954, Trans. IRE Prof. Group Inf. Theory.

[49]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[50]  Jason H. Moore,et al.  Power of multifactor dimensionality reduction for detecting gene‐gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity , 2003, Genetic epidemiology.

[51]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[52]  Nancy J. Cox,et al.  Loci on chromosomes 2 (NIDDM1) and 15 interact to increase susceptibility to diabetes in Mexican Americans , 1999, Nature Genetics.

[53]  Ryszard S. Michalski,et al.  Data-Driven Constructive Induction , 1998, IEEE Intell. Syst..

[54]  Dr. Zbigniew Michalewicz,et al.  How to Solve It: Modern Heuristics , 2004 .

[55]  W. Hollander,et al.  EPISTASIS AND HYPOSTASIS , 1955 .

[56]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[57]  Jason H Moore,et al.  Computational analysis of gene-gene interactions using multifactor dimensionality reduction , 2004, Expert review of molecular diagnostics.

[58]  D. Clayton,et al.  Statistical modeling of interlocus interactions in a complex disease: rejection of the multiplicative model of epistasis in type 1 diabetes. , 2001, Genetics.

[59]  Serge Batalov,et al.  Susceptibility and modifier genes in Portuguese transthyretin V30M amyloid polyneuropathy: complexity in a single-gene disease. , 2005, Human molecular genetics.

[60]  Jason H. Moore,et al.  Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions , 2003, Bioinform..

[61]  Russell A Wilke,et al.  Relative impact of CYP3A genotype and concomitant medication on the severity of atorvastatin-induced muscle damage , 2005, Pharmacogenetics and genomics.

[62]  Nancy J Cox,et al.  Linkage of calpain 10 to type 2 diabetes: the biological rationale. , 2004, Diabetes.

[63]  David M. Reif,et al.  Combinatorial Pharmacogenetics , 2005, Nature Reviews Drug Discovery.

[64]  Patrick C Phillips,et al.  The Opportunity for Canalization and the Evolution of Genetic Networks , 2004, The American Naturalist.

[65]  Jason H. Moore,et al.  The Ubiquitous Nature of Epistasis in Determining Susceptibility to Common Human Diseases , 2003, Human Heredity.

[66]  J. H. Moore,et al.  Multifactor-dimensionality reduction shows a two-locus interaction associated with Type 2 diabetes mellitus , 2004, Diabetologia.

[67]  C. Waddington Canalization of Development and the Inheritance of Acquired Characters , 1942, Nature.

[68]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[69]  O. Smithies,et al.  Human genetics, animal models and computer simulations for studying hypertension. , 2004, Trends in genetics : TIG.

[70]  M Farrall,et al.  Two-locus maximum lod score analysis of a multifactorial trait: joint consideration of IDDM2 and IDDM4 with IDDM1 in type 1 diabetes. , 1995, American journal of human genetics.

[71]  P. Phillips The language of gene interaction. , 1998, Genetics.

[72]  Jason H. Moore,et al.  A global view of epistasis , 2005, Nature Genetics.