Statistical Applications in Genetics and Molecular Biology A Comparison of Multifactor Dimensionality Reduction and L1-Penalized Regression to Identify Gene-Gene Interactions in Genetic

Recently, the amount of high-dimensional data has exploded, creating new analytical challenges for human genetics. Furthermore, much evidence suggests that common complex diseases may be due to complex etiologies such as gene-gene interactions, which are difficult to identify in high-dimensional data using traditional statistical approaches. Data-mining approaches are gaining popularity for variable selection in association studies, and one of the most commonly used methods to evaluate potential gene-gene interactions is Multifactor Dimensionality Reduction (MDR). Additionally, a number of penalized regression techniques, such as Lasso, are gaining popularity within the statistical community and are now being applied to association studies, including extensions for interactions. In this study, we compare the performance of MDR, the traditional lasso with L1 penalty (TL1), and the group lasso for categorical data with group-wise L1 penalty (GL1) to detect gene-gene interactions through a broad range of simulations. We find that each method has both advantages and disadvantages, and relative performance is context dependent. TL1 frequently over-fits, identifying false positive as well as true positive loci. MDR has higher power for epistatic models that exhibit independent main effects; for both Lasso methods, main effects tend to dominate. For purely epistatic models, GL1 has the best performance for lower minor allele frequencies, but MDR performs best for higher frequencies. These results provide guidance of when each approach might be best suited for detecting and characterizing interactions with different mechanisms.

[1]  Mee Young Park,et al.  L 1-regularization path algorithm for generalized linear models , 2006 .

[2]  Sanjoy Dasgupta,et al.  Adaptive Control Processes , 2010, Encyclopedia of Machine Learning and Data Mining.

[3]  C. Fordham von Reyn AIDS Clinical Trials Group Study Numbers , 1995 .

[4]  Jason H. Moore,et al.  The Ubiquitous Nature of Epistasis in Determining Susceptibility to Common Human Diseases , 2003, Human Heredity.

[5]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[6]  W. Oetting,et al.  Power of multifactor dimensionality reduction and penalized logistic regression for detecting gene-gene Interaction in a case-control study , 2009, BMC Medical Genetics.

[7]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[8]  J. Rice,et al.  Two‐Locus models of disease , 1992, Genetic epidemiology.

[9]  Alison A Motsinger-Reif The effect of alternative permutation testing strategies on the performance of multifactor dimensionality reduction , 2008, BMC Research Notes.

[10]  K. Roeder,et al.  Screen and clean: a tool for identifying interactions in genome‐wide association studies , 2010, Genetic epidemiology.

[11]  Scott M. Williams,et al.  A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction , 2007, Genetic epidemiology.

[12]  Jason H. Moore,et al.  Evaporative cooling feature selection for genotypic data involving interactions , 2007, Bioinform..

[13]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[14]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[15]  Chenlei Leng,et al.  Unified LASSO Estimation by Least Squares Approximation , 2007 .

[16]  Mee Young Park,et al.  Penalized logistic regression for detecting gene interactions. , 2008, Biostatistics.

[17]  Margaret A. Pericak-Vance,et al.  Complex gene–gene interactions in multiple sclerosis: a multifactorial approach reveals associations with inflammatory genes , 2006, Neurogenetics.

[18]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[19]  Runze Li,et al.  Tuning parameter selectors for the smoothly clipped absolute deviation method. , 2007, Biometrika.

[20]  J L Haines,et al.  Multifactor dimensionality reduction reveals gene–gene interactions associated with multiple sclerosis susceptibility in African Americans , 2006, Genes and Immunity.

[21]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[22]  Jason H. Moore,et al.  Power of multifactor dimensionality reduction for detecting gene‐gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity , 2003, Genetic epidemiology.

[23]  Wentian Li,et al.  A Complete Enumeration and Classification of Two-Locus Disease Models , 1999, Human Heredity.

[24]  Qiang Yang,et al.  Identifying main effects and epistatic interactions from large-scale SNP data via adaptive group Lasso , 2010, BMC Bioinformatics.

[25]  Alison A Motsinger,et al.  The effect of reduction in cross‐validation intervals on the performance of multifactor dimensionality reduction , 2006, Genetic epidemiology.

[26]  Jason H. Moore,et al.  Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions , 2003, Bioinform..

[27]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[28]  R. Bellman,et al.  V. Adaptive Control Processes , 1964 .

[29]  David M. Reif,et al.  A comparison of analytical methods for genetic association studies , 2008, Genetic epidemiology.

[30]  Alison A Motsinger,et al.  Immunogenetics of CD4 lymphocyte count recovery during antiretroviral therapy: An AIDS Clinical Trials Group study. , 2006, The Journal of infectious diseases.

[31]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[32]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[33]  Marylyn D. Ritchie,et al.  Data Simulation Software for Whole-Genome Association and Other Studies in Human Genetics , 2005, Pacific Symposium on Biocomputing.