A computationally efficient hypothesis testing method for epistasis analysis using multifactor dimensionality reduction

Multifactor dimensionality reduction (MDR) was developed as a nonparametric and model‐free data mining method for detecting, characterizing, and interpreting epistasis in the absence of significant main effects in genetic and epidemiologic studies of complex traits such as disease susceptibility. The goal of MDR is to change the representation of the data using a constructive induction algorithm to make nonadditive interactions easier to detect using any classification method such as naïve Bayes or logistic regression. Traditionally, MDR constructed variables have been evaluated with a naïve Bayes classifier that is combined with 10‐fold cross validation to obtain an estimate of predictive accuracy or generalizability of epistasis models. Traditionally, we have used permutation testing to statistically evaluate the significance of models obtained through MDR. The advantage of permutation testing is that it controls for false positives due to multiple testing. The disadvantage is that permutation testing is computationally expensive. This is an important issue that arises in the context of detecting epistasis on a genome‐wide scale. The goal of the present study was to develop and evaluate several alternatives to large‐scale permutation testing for assessing the statistical significance of MDR models. Using data simulated from 70 different epistasis models, we compared the power and type I error rate of MDR using a 1,000‐fold permutation test with hypothesis testing using an extreme value distribution (EVD). We find that this new hypothesis testing method provides a reasonable alternative to the computationally expensive 1,000‐fold permutation test and is 50 times faster. We then demonstrate this new method by applying it to a genetic epidemiology study of bladder cancer susceptibility that was previously analyzed using MDR and assessed using a 1,000‐fold permutation test. Genet. Epidemiol. 2008. © 2008 Wiley‐Liss, Inc.

[1]  Margaret R Karagas,et al.  Concordance of multiple analytical approaches demonstrates a complex relationship between DNA repair gene SNPs, smoking and bladder cancer susceptibility. , 2006, Carcinogenesis.

[2]  W. Bateson Mendel's Principles of Heredity , 1910, Nature.

[3]  Scott M. Williams,et al.  A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction , 2007, Genetic epidemiology.

[4]  E. Gumbel,et al.  Statistics of extremes , 1960 .

[5]  Jason H. Moore,et al.  Ideal discrimination of discrete clinical endpoints using multilocus genotypes , 2004, Silico Biol..

[6]  Todd Holden,et al.  A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. , 2006, Journal of theoretical biology.

[7]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[8]  Ryszard S. Michalski,et al.  A theory and methodology of inductive learning , 1993 .

[9]  Wentian Li,et al.  A Complete Enumeration and Classification of Two-Locus Disease Models , 1999, Human Heredity.

[10]  Scott M. Williams,et al.  New strategies for identifying gene-gene interactions in hypertension , 2002, Annals of medicine.

[11]  Scott M. Williams,et al.  Traversing the conceptual divide between biological and statistical epistasis: systems biology and a more modern synthesis. , 2005, BioEssays : news and reviews in molecular, cellular and developmental biology.

[12]  M J Khoury,et al.  The future of genetic studies of complex human diseases: an epidemiologic perspective. , 1998, Epidemiology.

[13]  Jason H. Moore,et al.  Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions , 2003, Bioinform..

[14]  Jason H. Moore,et al.  Genome-Wide Analysis of Epistasis Using Multifactor Dimensionality Reduction: Feature Selection and Construction in the Domain of Human Genetics , 2009 .

[15]  E. J. Gumbel,et al.  Statistics of Extremes. , 1960 .

[16]  R. Fisher XV.—The Correlation between Relatives on the Supposition of Mendelian Inheritance. , 1919, Transactions of the Royal Society of Edinburgh.

[17]  Frank Dudbridge,et al.  Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies. , 2004, American journal of human genetics.

[18]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[19]  Jason H. Moore,et al.  A global view of epistasis , 2005, Nature Genetics.

[20]  Jason H Moore,et al.  Computational analysis of gene-gene interactions using multifactor dimensionality reduction , 2004, Expert review of molecular diagnostics.

[21]  L. Penrose,et al.  THE CORRELATION BETWEEN RELATIVES ON THE SUPPOSITION OF MENDELIAN INHERITANCE , 2022 .

[22]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[23]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[24]  H. Cordell Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. , 2002, Human molecular genetics.

[25]  Jason H. Moore,et al.  STUDENTJAMA. The challenges of whole-genome approaches to common diseases. , 2004, JAMA.

[26]  C. Sing,et al.  Complex Adaptive System Models and the Genetic Analysis of Plasma HDL-Cholesterol Concentration , 2006, Perspectives in biology and medicine.

[27]  J. Stengård,et al.  Genes, Environment, and Cardiovascular Disease , 2003, Arteriosclerosis, thrombosis, and vascular biology.

[28]  Jason H. Moore,et al.  The Ubiquitous Nature of Epistasis in Determining Susceptibility to Common Human Diseases , 2003, Human Heredity.

[29]  Eric P. Smith,et al.  An Introduction to Statistical Modeling of Extreme Values , 2002, Technometrics.

[30]  Jason H. Moore,et al.  Power of multifactor dimensionality reduction for detecting gene‐gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity , 2003, Genetic epidemiology.

[31]  Malcolm R Leadbetter,et al.  Extremes and local dependence in stationary sequences , 1983 .

[32]  R. Nagel,et al.  Epistasis and the genetics of human diseases. , 2005, Comptes rendus biologies.

[33]  Jonathan L Haines,et al.  Genetics, statistics and human disease: analytical retooling for complexity. , 2004, Trends in genetics : TIG.