Conditional Sure Independence Screening

ABSTRACT Independence screening is powerful for variable selection when the number of variables is massive. Commonly used independence screening methods are based on marginal correlations or its variants. When some prior knowledge on a certain important set of variables is available, a natural assessment on the relative importance of the other predictors is their conditional contributions to the response given the known set of variables. This results in conditional sure independence screening (CSIS). CSIS produces a rich family of alternative screening methods by different choices of the conditioning set and can help reduce the number of false positive and false negative selections when covariates are highly correlated. This article proposes and studies CSIS in generalized linear models. We give conditions under which sure screening is possible and derive an upper bound on the number of selected variables. We also spell out the situation under which CSIS yields model selection consistency and the properties of CSIS when a data-driven conditioning set is used. Moreover, we provide two data-driven methods to select the thresholding parameter of conditional screening. The utility of the procedure is illustrated by simulation studies and analysis of two real datasets. Supplementary materials for this article are available online.

[1]  J. Friedman,et al.  [A Statistical View of Some Chemometrics Regression Tools]: Response , 1993 .

[2]  E. Fama,et al.  Common risk factors in the returns on stocks and bonds , 1993 .

[3]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[4]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[5]  M. R. Osborne,et al.  On the LASSO and its Dual , 2000 .

[6]  M. R. Osborne,et al.  A new approach to variable selection in least squares problems , 2000 .

[7]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[8]  J. Dongen,et al.  Comparative analysis of T-cell receptor gene rearrangements at diagnosis and relapse of T-cell acute lymphoblastic leukemia (T-ALL) shows high stability of clonal markers for monitoring of minimal residual disease and reveals the occurrence of second T-ALL , 2003, Leukemia.

[9]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[10]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[11]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[12]  C. Robert Discussion of "Sure independence screening for ultra-high dimensional feature space" by Fan and Lv. , 2008 .

[13]  Zhanfeng Wang,et al.  Asymptotic Normality of Maximum Quasi-Likelihood Estimators in Generalized Linear Models with Fixed Design* , 2008, J. Syst. Sci. Complex..

[14]  Yichao Wu,et al.  Ultrahigh Dimensional Feature Selection: Beyond The Linear Model , 2009, J. Mach. Learn. Res..

[15]  L. Wasserman,et al.  HIGH DIMENSIONAL VARIABLE SELECTION. , 2007, Annals of statistics.

[16]  P. Hall,et al.  Tilting methods for assessing the influence of components in a classifier , 2009 .

[17]  Hansheng Wang Forward Regression for Ultra-High Dimensional Variable Screening , 2009 .

[18]  Peter Hall,et al.  Using Generalized Correlation to Effect Variable Selection in Very High Dimensional Problems , 2009 .

[19]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[20]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[21]  Jianqing Fan,et al.  Sure independence screening in generalized linear models with NP-dimensionality , 2009, The Annals of Statistics.

[22]  Yang Feng,et al.  Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models , 2009, Journal of the American Statistical Association.

[23]  Runze Li,et al.  Model-Free Feature Screening for Ultrahigh-Dimensional Data , 2011, Journal of the American Statistical Association.

[24]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[25]  Jianqing Fan,et al.  Nonconcave Penalized Likelihood With NP-Dimensionality , 2009, IEEE Transactions on Information Theory.

[26]  Tong Zhang,et al.  A General Theory of Concave Regularization for High-Dimensional Sparse Estimation Problems , 2011, 1108.4988.

[27]  Jun Zhang,et al.  Robust rank correlation based screening , 2010, 1012.4255.

[28]  Yi Li,et al.  Principled sure independence screening for Cox models with ultra-high-dimensional covariates , 2012, J. Multivar. Anal..

[29]  Runze Li,et al.  Feature Screening via Distance Correlation Learning , 2012, Journal of the American Statistical Association.

[30]  Hengjian Cui,et al.  Model-Free Feature Screening for Ultrahigh , 2014 .

[31]  秀俊 松井,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2014 .

[32]  Ulrich Amsel,et al.  Quasi Likelihood And Its Application A General Approach To Optimal Parameter Estimation , 2016 .

[33]  Jianqing Fan,et al.  Sure Independence Screening , 2018 .