Resampling-based tests for Lasso in genome-wide association studies

BackgroundGenome-wide association studies involve detecting association between millions of genetic variants and a trait, which typically use univariate regression to test association between each single variant and the phenotype. Alternatively, Lasso penalized regression allows one to jointly model the relationship between all genetic variants and the phenotype. However, it is unclear how to best conduct inference on the individual Lasso coefficients, especially in high-dimensional settings.MethodsWe consider six methods for testing the Lasso coefficients: two permutation (Lasso-Ayers, Lasso-PL) and one analytic approach (Lasso-AL) to select the penalty parameter for type-1-error control, residual bootstrap (Lasso-RB), modified residual bootstrap (Lasso-MRB), and a permutation test (Lasso-PT). Methods are compared via simulations and application to the Minnesota Center for Twins and Family Study.ResultsWe show that for finite sample sizes with increasing number of null predictors, Lasso-RB, Lasso-MRB, and Lasso-PT fail to be viable methods of inference. However, Lasso-PL and Lasso-AL remain fast and powerful tools for conducting inference with the Lasso, even in high-dimensions.ConclusionOur results suggest that the proposed permutation selection procedure (Lasso-PL) and the analytic selection method (Lasso-AL) are fast and powerful alternatives to the standard univariate analysis in genome-wide association studies.

[1]  Tesi di Dottorato,et al.  Penalized Regression: bootstrap confidence intervals and variable selection for high dimensional data sets. , 2010 .

[2]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[3]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[4]  S. N. Lahiri,et al.  Asymptotic properties of the residual bootstrap for Lasso estimators , 2010 .

[5]  Jelle J. Goeman,et al.  Multiple hypothesis testing in genomics , 2014, Statistics in medicine.

[6]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[7]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  Adel Javanmard,et al.  Confidence intervals and hypothesis testing for high-dimensional regression , 2013, J. Mach. Learn. Res..

[10]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[11]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[12]  W. G. Hill,et al.  Linkage disequilibrium in finite populations , 1968, Theoretical and Applied Genetics.

[13]  L. Kruglyak Prospects for whole-genome linkage disequilibrium mapping of common disease genes , 1999, Nature Genetics.

[14]  Dennis L. Sun,et al.  Exact post-selection inference, with application to the lasso , 2013, 1311.6238.

[15]  Peter Bühlmann,et al.  p-Values for High-Dimensional Regression , 2008, 0811.2177.

[16]  H. Cordell,et al.  SNP Selection in Genome-Wide and Candidate Gene Studies via Penalized Logistic Regression , 2010, Genetic epidemiology.

[17]  P. Waldmann,et al.  Evaluation of the lasso and the elastic net in genome-wide association studies , 2013, Front. Genet..

[18]  Ina Hoeschele,et al.  Penalized Multimarker vs. Single-Marker Regression Methods for Genome-Wide Association Studies of Quantitative Traits , 2014, Genetics.

[19]  Stephen Weston,et al.  Scalable Strategies for Computing with Massive Data , 2013 .

[20]  Benjamin A. Logsdon,et al.  PUMA: A Unified Framework for Penalized Multiple Regression Analysis of GWAS Data , 2013, PLoS Comput. Biol..

[21]  Pierre Legendre,et al.  An empirical comparison of permutation methods for tests of partial regression coefficients in a linear model , 1999 .

[22]  Cun-Hui Zhang,et al.  Confidence intervals for low dimensional parameters in high dimensional linear models , 2011, 1110.2563.

[23]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[24]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[25]  J. Marchini,et al.  Fast and accurate genotype imputation in genome-wide association studies through pre-phasing , 2012, Nature Genetics.

[26]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[27]  Jonathan Taylor,et al.  Statistical learning and selective inference , 2015, Proceedings of the National Academy of Sciences.

[28]  Chun Li,et al.  GWAsimulator: a rapid whole-genome simulation program , 2007, Bioinform..

[29]  Marie Wiberg,et al.  Performing the Kernel Method of Test Equating with the Package kequate , 2013 .

[30]  Matt McGue,et al.  Psychometric and Genetic Architecture of Substance Use Disorder and Behavioral Disinhibition Measures for Gene Association Studies , 2011, Behavior genetics.

[31]  Eleazar Eskin,et al.  The Minnesota Center for Twin and Family Research Genome-Wide Association Study , 2012, Twin Research and Human Genetics.

[32]  S. Lahiri,et al.  Bootstrapping Lasso Estimators , 2011 .

[33]  P. Breheny Estimating false inclusion rates in penalized regression models , 2016 .

[34]  J. Pritchard,et al.  Linkage disequilibrium in humans: models and data. , 2001, American journal of human genetics.

[35]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[36]  W. Iacono,et al.  A Genome-Wide Association Study of Behavioral Disinhibition , 2013, Behavior genetics.

[37]  J. Gentle,et al.  Randomization and Monte Carlo Methods in Biology. , 1990 .