The use of vector bootstrapping to improve variable selection precision in Lasso models

Abstract The Lasso is a shrinkage regression method that is widely used for variable selection in statistical genetics. Commonly, K-fold cross-validation is used to fit a Lasso model. This is sometimes followed by using bootstrap confidence intervals to improve precision in the resulting variable selections. Nesting cross-validation within bootstrapping could provide further improvements in precision, but this has not been investigated systematically. We performed simulation studies of Lasso variable selection precision (VSP) with and without nesting cross-validation within bootstrapping. Data were simulated to represent genomic data under a polygenic model as well as under a model with effect sizes representative of typical GWAS results. We compared these approaches to each other as well as to software defaults for the Lasso. Nested cross-validation had the most precise variable selection at small effect sizes. At larger effect sizes, there was no advantage to nesting. We illustrated the nested approach with empirical data comprising SNPs and SNP-SNP interactions from the most significant SNPs in a GWAS of borderline personality symptoms. In the empirical example, we found that the default Lasso selected low-reliability SNPs and interactions which were excluded by bootstrapping.

[1]  D. Balding A tutorial on statistical methods for population association studies , 2006, Nature Reviews Genetics.

[2]  Tesi di Dottorato,et al.  Penalized Regression: bootstrap confidence intervals and variable selection for high dimensional data sets. , 2010 .

[3]  M. R. Osborne,et al.  On the LASSO and its Dual , 2000 .

[4]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[5]  Wenjiang J. Fu,et al.  Asymptotics for lasso-type estimators , 2000 .

[6]  K. Lange,et al.  Next Generation Statistical Genetics: Modeling, Penalization, and Optimization in High-Dimensional Data. , 2014, Annual review of statistics and its application.

[7]  Gareth M. James,et al.  A generalized Dantzig selector with shrinkage tuning , 2009 .

[8]  T. Tony Cai,et al.  Discussion: "A significance test for the lasso" , 2014, 1405.6793.

[9]  G. Casella,et al.  Penalized regression, standard errors, and Bayesian lassos , 2010 .

[10]  Pall I. Olason,et al.  Common variants conferring risk of schizophrenia , 2009, Nature.

[11]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[12]  R. Carroll,et al.  Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants , 2011, Proceedings of the National Academy of Sciences.

[13]  K. Burnham,et al.  Model selection: An integral part of inference , 1997 .

[14]  R. Tibshirani The Lasso Problem and Uniqueness , 2012, 1206.0313.

[15]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[16]  Qianchuan He,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[17]  William Valdar,et al.  Reprioritizing Genetic Associations in Hit Regions Using LASSO-Based Resample Model Averaging , 2012, Genetic epidemiology.

[18]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[19]  Danielle Posthuma,et al.  Netherlands Twin Register: From Twins to Twin Families , 2006, Twin Research and Human Genetics.

[20]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[21]  C. Duijn,et al.  Genome-wide analyses of borderline personality features , 2013, Molecular Psychiatry.

[22]  Genetic Prediction of Quantitative Lipid Traits: Comparing Shrinkage Models to Gene Scores , 2014, Genetic epidemiology.

[23]  Peter Bühlmann Regression shrinkage and selection via the Lasso: a retrospective (Robert Tibshirani): Comments on the presentation , 2011 .

[24]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[25]  T. Salakoski,et al.  Regularized Machine Learning in the Genetic Prediction of Complex Traits , 2014, PLoS genetics.

[26]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[27]  H. Cordell,et al.  SNP Selection in Genome-Wide and Candidate Gene Studies via Penalized Logistic Regression , 2010, Genetic epidemiology.

[28]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[29]  David A. Freedman,et al.  A Nonstochastic Interpretation of Reported Significance Levels , 1983 .

[30]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[31]  Igor Jurisica,et al.  Optimized application of penalized regression methods to diverse genomic data , 2011, Bioinform..

[32]  S. Lahiri,et al.  Bootstrapping Lasso Estimators , 2011 .

[33]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[34]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[35]  H. M. Draisma,et al.  The Adult Netherlands Twin Register: Twenty-Five Years of Survey and Biological Data Collection , 2013, Twin Research and Human Genetics.

[36]  Lu Tian,et al.  A Perturbation Method for Inference on Regularized Regression Estimates , 2011, Journal of the American Statistical Association.

[37]  Benedikt M. Pötscher,et al.  On the Distribution of Penalized Maximum Likelihood Estimators: The LASSO, SCAD, and Thresholding , 2007, J. Multivar. Anal..

[38]  Taesung Park,et al.  Joint Identification of Multiple Genetic Variants via Elastic‐Net Variable Selection in a Genome‐Wide Association Analysis , 2010, Annals of human genetics.

[39]  Ping Zhang Model Selection Via Multifold Cross Validation , 1993 .

[40]  F. Agakov,et al.  Abundant pleiotropy in human complex diseases and traits. , 2011, American journal of human genetics.

[41]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[42]  Susan R Wilson,et al.  LASSO model selection with post-processing for a genome-wide association study data set , 2011, BMC proceedings.

[43]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[44]  Jianqing Fan,et al.  Variance estimation using refitted cross‐validation in ultrahigh dimensional regression , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[45]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[46]  G. Wahba,et al.  A NOTE ON THE LASSO AND RELATED PROCEDURES IN MODEL SELECTION , 2006 .

[47]  L. Wasserman,et al.  Analysis of multilocus models of association , 2003, Genetic epidemiology.

[48]  D. Barr,et al.  Mean and Variance of Truncated Normal Distributions , 1999 .

[49]  G. D'Angelo,et al.  Combining least absolute shrinkage and selection operator (LASSO) and principal-components analysis for detection of gene-gene interactions in genome-wide association studies , 2009, BMC proceedings.

[50]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[51]  C. Chatfield Model uncertainty, data mining and statistical inference , 1995 .