False Discovery Rate Control via Debiased Lasso

We consider the problem of variable selection in high-dimensional statistical models where the goal is to report a set of variables, out of many predictors $X_1, \dotsc, X_p$, that are relevant to a response of interest. For linear high-dimensional model, where the number of parameters exceeds the number of samples $(p>n)$, we propose a procedure for variables selection and prove that it controls the \emph{directional} false discovery rate (FDR) below a pre-assigned significance level $q\in [0,1]$. We further analyze the statistical power of our framework and show that for designs with subgaussian rows and a common precision matrix $\Omega\in\mathbb{R}^{p\times p}$, if the minimum nonzero parameter $\theta_{\min}$ satisfies $$\sqrt{n} \theta_{\min} - \sigma \sqrt{2(\max_{i\in [p]}\Omega_{ii})\log\left(\frac{2p}{qs_0}\right)} \to \infty\,,$$ then this procedure achieves asymptotic power one. Our framework is built upon the debiasing approach and assumes the standard condition $s_0 = o(\sqrt{n}/(\log p)^2)$, where $s_0$ indicates the number of true positives among the $p$ features. Notably, this framework achieves exact directional FDR control without any assumption on the amplitude of unknown regression parameters, and does not require any knowledge of the distribution of covariates or the noise level. We test our method in synthetic and real data experiments to asses its performance and to corroborate our theoretical results.

[1]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[2]  A. Owen Variance of the number of false discoveries , 2005 .

[3]  Shuheng Zhou,et al.  25th Annual Conference on Learning Theory Reconstruction from Anisotropic Random Measurements , 2022 .

[4]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[5]  Aravaipa Canyon Basin,et al.  Volume 3 , 2012, Journal of Diabetes Investigation.

[6]  Christian Hansen,et al.  High-dimensional econometrics and regularized GMM , 2018, 1806.01888.

[7]  J. Tukey The Philosophy of Multiple Comparisons , 1991 .

[8]  Yoav Benjamini,et al.  Identifying differentially expressed genes using false discovery rate controlling procedures , 2003, Bioinform..

[9]  A. Belloni,et al.  Least Squares After Model Selection in High-Dimensional Sparse Models , 2009, 1001.0188.

[10]  E. Candès,et al.  Controlling the false discovery rate via knockoffs , 2014, 1404.5609.

[11]  R. Tibshirani,et al.  A Study of Error Variance Estimation in Lasso Regression , 2013, 1311.5274.

[12]  Q. Shao,et al.  Phase Transition and Regularized Bootstrap in Large Scale $t$-tests with False Discovery Rate Control , 2013, 1310.4371.

[13]  P. Bühlmann Statistical significance in high-dimensional linear models , 2013 .

[14]  S. Geer,et al.  On the conditions used to prove oracle results for the Lasso , 2009, 0910.0722.

[15]  Jianqing Fan,et al.  Variance estimation using refitted cross‐validation in ultrahigh dimensional regression , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[16]  Adel Javanmard,et al.  Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition , 2013, NIPS.

[17]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[18]  Thomas E. Nichols,et al.  Thresholding of Statistical Maps in Functional Neuroimaging Using the False Discovery Rate , 2002, NeuroImage.

[19]  Lucas Janson,et al.  Panning for gold: ‘model‐X’ knockoffs for high dimensional controlled variable selection , 2016, 1610.02351.

[20]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[21]  Karim Lounici Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators , 2008, 0801.4610.

[22]  Weidong Liu Gaussian graphical model estimation with false discovery rate control , 2013, 1306.0976.

[23]  Peter Buhlmann Statistical significance in high-dimensional linear models , 2012, 1202.1377.

[24]  S. Geer,et al.  On asymptotically optimal confidence regions and tests for high-dimensional models , 2013, 1303.0518.

[25]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[26]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[27]  Emmanuel J. Candes,et al.  Robust inference with knockoffs , 2018, The Annals of Statistics.

[28]  Wenguang Sun,et al.  False discovery control in large‐scale spatial multiple testing , 2015, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[29]  R. Shafer,et al.  Genotypic predictors of human immunodeficiency virus type 1 drug resistance , 2006, Proceedings of the National Academy of Sciences.

[30]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[31]  Christian Hansen,et al.  High-Dimensional Econometrics and Generalized GMM , 2018 .

[32]  Jianqing Fan,et al.  Control of the False Discovery Rate Under Arbitrary Covariance Dependence , 2010, 1012.4397.

[33]  Cun-Hui Zhang,et al.  Confidence intervals for low dimensional parameters in high dimensional linear models , 2011, 1110.2563.

[34]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[35]  E. Candès,et al.  Near-ideal model selection by ℓ1 minimization , 2008, 0801.0345.

[36]  Tommy F. Liu,et al.  HIV-1 Protease and reverse-transcriptase mutations: correlations with antiretroviral therapy in subtype B isolates and implications for drug-resistance surveillance. , 2005, The Journal of infectious diseases.

[37]  Francis Tuerlinckx,et al.  Type S error rates for classical and Bayesian single and multiple comparison procedures , 2000, Comput. Stat..

[38]  W. Wu,et al.  On false discovery control under dependence , 2008, 0803.1971.

[39]  E. Candès,et al.  A knockoff filter for high-dimensional selective inference , 2016, The Annals of Statistics.

[40]  Y. Ritov,et al.  Persistence in high-dimensional linear predictor selection and the virtue of overparametrization , 2004 .

[41]  Hongzhe Li,et al.  Optimal False Discovery Rate Control for Dependent Data. , 2011, Statistics and its interface.

[42]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[43]  Adel Javanmard,et al.  Nearly optimal sample size in hypothesis testing for high-dimensional regression , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[44]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[45]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[46]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[47]  Z. W. Birnbaum,et al.  An Inequality for Mill's Ratio , 1942 .

[48]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[49]  Cun-Hui Zhang,et al.  Scaled sparse linear regression , 2011, 1104.4595.

[50]  Adel Javanmard,et al.  Debiasing the lasso: Optimal sample size for Gaussian designs , 2015, The Annals of Statistics.

[51]  Victor Chernozhukov,et al.  Inference on Treatment Effects after Selection Amongst High-Dimensional Controls , 2011 .

[52]  A. Belloni,et al.  Least Squares After Model Selection in High-Dimensional Sparse Models , 2009 .

[53]  S. Geer,et al.  ℓ1-penalization for mixture regression models , 2010, 1202.6046.

[54]  Adel Javanmard,et al.  Hypothesis Testing in High-Dimensional Regression Under the Gaussian Random Design Model: Asymptotic Theory , 2013, IEEE Transactions on Information Theory.

[55]  Wenjiang J. Fu,et al.  Asymptotics for lasso-type estimators , 2000 .

[56]  Adel Javanmard,et al.  Confidence intervals and hypothesis testing for high-dimensional regression , 2013, J. Mach. Learn. Res..

[57]  A. Belloni,et al.  Inference on Treatment Effects after Selection Amongst High-Dimensional Controls , 2011, 1201.0224.

[58]  Gaorong Li,et al.  RANK: Large-Scale Inference With Graphical Nonlinear Knockoffs , 2017, Journal of the American Statistical Association.