False discovery control for penalized variable selections with high-dimensional covariates

Abstract Modern bio-technologies have produced a vast amount of high-throughput data with the number of predictors much exceeding the sample size. Penalized variable selection has emerged as a powerful and efficient dimension reduction tool. However, control of false discoveries (i.e. inclusion of irrelevant variables) for penalized high-dimensional variable selection presents serious challenges. To effectively control the fraction of false discoveries for penalized variable selections, we propose a false discovery controlling procedure. The proposed method is general and flexible, and can work with a broad class of variable selection algorithms, not only for linear regressions, but also for generalized linear models and survival analysis.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[3]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[4]  H. Zou,et al.  Regression Shrinkage and Selection via the Elastic Net , with Applications to Microarrays , 2003 .

[5]  L. Wasserman,et al.  A stochastic process approach to false discovery control , 2004, math/0406519.

[6]  Jiang Gui,et al.  Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data , 2005, Bioinform..

[7]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[8]  Yongsheng Huang,et al.  A validated gene expression model of high-risk multiple myeloma is defined by deregulated expression of genes mapping to chromosome 1. , 2006, Blood.

[9]  Tianxi Cai,et al.  Evaluating Prediction Rules for t-Year Survivors With Censored Regression Models , 2007 .

[10]  C. Robert Discussion of "Sure independence screening for ultra-high dimensional feature space" by Fan and Lv. , 2008 .

[11]  Bradley Efron,et al.  Microarrays, Empirical Bayes and the Two-Groups Model. Rejoinder. , 2008, 0808.0572.

[12]  Peter Bühlmann,et al.  p-Values for High-Dimensional Regression , 2008, 0811.2177.

[13]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[14]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[15]  Kevin C. Dorff,et al.  The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models , 2010, Nature Biotechnology.

[16]  Bradley Efron,et al.  Large-scale inference , 2010 .

[17]  Taesung Park,et al.  Joint Identification of Multiple Genetic Variants via Elastic‐Net Variable Selection in a Genome‐Wide Association Analysis , 2010, Annals of human genetics.

[18]  H. Cordell,et al.  SNP Selection in Genome-Wide and Candidate Gene Studies via Penalized Logistic Regression , 2010, Genetic epidemiology.

[19]  Trevor Hastie,et al.  Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent. , 2011, Journal of statistical software.

[20]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[21]  Bin Wang,et al.  Deconvolution Estimation in Measurement Error Models: The R Package decon. , 2011, Journal of statistical software.

[22]  Isaac Dialsingh,et al.  Large-scale inference: empirical Bayes methods for estimation, testing, and prediction , 2012 .

[23]  B. Efron Estimation and Accuracy After Model Selection , 2014, Journal of the American Statistical Association.

[24]  秀俊 松井,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2014 .

[25]  E. Candès,et al.  Controlling the false discovery rate via knockoffs , 2014, 1404.5609.

[26]  Yi Li,et al.  Component-wise gradient boosting and false discovery control in survival analysis with high-dimensional covariates , 2015, Bioinform..

[27]  C. Felser,et al.  Negative magnetoresistance without well-defined chirality in the Weyl semimetal TaP , 2015, Nature Communications.

[28]  Xiang Zhou,et al.  Differential expression analysis for RNAseq using Poisson mixed models , 2016, bioRxiv.