selectBoost: a general algorithm to enhance the performance of variable selection methods

MOTIVATION With the growth of big data, variable selection has become one of the critical challenges in statistics. Although many methods have been proposed in the literature their performance in terms of recall (sensitivity) and precision (PPV) is limited in a context where the number of variables by far exceeds the number of observations or in a highly correlated setting. RESULTS In this article, we propose a general algorithm which improves the precision of any existing variable selection method. This algorithm is based on highly intensive simulations and takes into account the correlation structure of the data. Our algorithm can either produce a confidence index for variable selection or be used in an experimental design planning perspective. We demonstrate the performance of our algorithm on both simulated and real data. We then apply it in two different ways to improve biological network reverse-engineering. AVAILABILITY Code is available as the SelectBoost package on the CRAN, https://cran.r-project.org/package=SelectBoost. Some network reverse-engineering functionalities are available in the Patterns CRAN package, https://cran.r-project.org/package=Patterns. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[2]  Li Chen,et al.  Noise-Based Feature Perturbation as a Selection Method for Microarray Data , 2007, ISBRA.

[3]  Suvrit Sra,et al.  A short note on parameter approximation for von Mises-Fisher distributions: and a fast implementation of Is(x) , 2012, Comput. Stat..

[4]  M. Stephens,et al.  Scalable Variational Inference for Bayesian Variable Selection in Regression, and Its Accuracy in Genetic Association Studies , 2012 .

[5]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[6]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[7]  J. Sosman,et al.  Genomic and Transcriptomic Features of Response to Anti-PD-1 Therapy in Metastatic Melanoma , 2017, Cell.

[8]  Sijian Wang,et al.  RANDOM LASSO. , 2011, The annals of applied statistics.

[9]  Francis R. Bach,et al.  Bolasso: model consistent Lasso estimation through the bootstrap , 2008, ICML '08.

[10]  Michael Elad,et al.  Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[11]  R. Tibshirani,et al.  REJOINDER TO "LEAST ANGLE REGRESSION" BY EFRON ET AL. , 2004, math/0406474.

[12]  D. Donoho,et al.  Atomic Decomposition by Basis Pursuit , 2001 .

[13]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[14]  Stefan Bornholdt,et al.  Handbook of Graphs and Networks: From the Genome to the Internet , 2003 .

[15]  Gilles Celeux,et al.  Data-based filtering for replicated high-throughput transcriptome sequencing experiments , 2013, Bioinform..

[16]  Jianqing Fan,et al.  Comments on «Wavelets in statistics: A review» by A. Antoniadis , 1997 .

[17]  Runze Li,et al.  Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery , 2006, math/0602133.

[18]  S. Keleş,et al.  Sparse partial least squares regression for simultaneous dimension reduction and variable selection , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[19]  Xiang Zhou,et al.  Polygenic Modeling with Bayesian Sparse Linear Mixed Models , 2012, PLoS genetics.

[20]  S. P. Fodor,et al.  High density synthetic oligonucleotide arrays , 1999, Nature Genetics.

[21]  M. Stephens,et al.  Bayesian variable selection regression for genome-wide association studies and other large-scale problems , 2011, 1110.6019.

[22]  Corey A. Kemper,et al.  Reverse-engineering the genetic circuitry of a cancer cell with predicted intervention in chronic lymphocytic leukemia , 2012, Proceedings of the National Academy of Sciences.

[23]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[24]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[25]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[26]  R. Tibshirani,et al.  A note on the group lasso and a sparse group lasso , 2010, 1001.0736.

[27]  J. R. Cook,et al.  Simulation-Extrapolation Estimation in Parametric Measurement Error Models , 1994 .

[28]  A. Antoniadis Wavelets in statistics: A review , 1997 .

[29]  Erik L. L. Sonnhammer,et al.  A generalized framework for controlling FDR in gene regulatory network inference , 2018, Bioinform..

[30]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[31]  Myriam Maumy-Bertrand,et al.  Cascade: a R package to study, predict and simulate the diffusion of a signal through a temporal gene network , 2014, Bioinform..

[32]  John R. Koza,et al.  Genetic Programming as a Darwinian Invention Machine , 1999, EuroGP.

[33]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[34]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[35]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[36]  Myriam Maumy-Bertrand,et al.  Deviance residuals-based sparse PLS and sparse kernel PLS regression for censored data , 2015, Bioinform..

[37]  C. Y. Peng,et al.  An Introduction to Logistic Regression Analysis and Reporting , 2002 .

[38]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[39]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[40]  Jianqing Fan,et al.  COMMENTS ON « WAVELETS IN STATISTICS : A REVIEW , 2009 .

[41]  H. Akaike A new look at the statistical model identification , 1974 .

[42]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[43]  Xiaohui Luo,et al.  Tuning Variable Selection Procedures by Adding Noise , 2006, Technometrics.

[44]  R. R. Hocking The analysis and selection of variables in linear regression , 1976 .

[45]  M. Eklund,et al.  SimSel: a new simulation method for variable selection , 2012 .

[46]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[47]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[48]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[49]  Jianqing Fan,et al.  A Selective Overview of Variable Selection in High Dimensional Feature Space. , 2009, Statistica Sinica.

[50]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[51]  Kam D. Dahlquist,et al.  Regression Approaches for Microarray Data Analysis , 2002, J. Comput. Biol..

[52]  Myriam Maumy-Bertrand,et al.  A new universal resample-stable bootstrap-based stopping criterion for PLS component construction , 2015, Stat. Comput..

[53]  L. Stefanski,et al.  Approved by: Project Leader Approved by: LCG Project Leader Prepared by: Project Manager Prepared by: LCG Project Manager Reviewed by: Quality Assurance Manager , 2004 .

[54]  R. Gentleman,et al.  Independent filtering increases detection power for high-throughput experiments , 2010, Proceedings of the National Academy of Sciences.