An iterative approach to distance correlation-based sure independence screening†

Feature screening and variable selection are fundamental in analysis of ultrahigh-dimensional data, which are being collected in diverse scientific fields at relatively low cost. Distance correlation-based sure independence screening (DC-SIS) has been proposed to perform feature screening for ultrahigh-dimensional data. The DC-SIS possesses sure screening property and filters out unimportant predictors in a model-free manner. Like all independence screening methods, however, it fails to detect the truly important predictors which are marginally independent of the response variable due to correlations among predictors. When there are many irrelevant predictors which are highly correlated with some strongly active predictors, the independence screening may miss other active predictors with relatively weak marginal signals. To improve the performance of DC-SIS, we introduce an effective iterative procedure based on distance correlation to detect all truly important predictors and potentially interactions in both linear and nonlinear models. Thus, the proposed iterative method possesses the favourable model-free and robust properties. We further illustrate its excellent finite-sample performance through comprehensive simulation studies and an empirical analysis of the rat eye expression data set.

[1]  Yichao Wu,et al.  Ultrahigh Dimensional Feature Selection: Beyond The Linear Model , 2009, J. Mach. Learn. Res..

[2]  H. Zou,et al.  One-step Sparse Estimates in Nonconcave Penalized Likelihood Models. , 2008, Annals of statistics.

[3]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[4]  Yang Feng,et al.  Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models , 2009, Journal of the American Statistical Association.

[5]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[6]  C. Robert Discussion of "Sure independence screening for ultra-high dimensional feature space" by Fan and Lv. , 2008 .

[7]  Runze Li,et al.  Model-Free Feature Screening for Ultrahigh-Dimensional Data , 2011, Journal of the American Statistical Association.

[8]  D. Madigan,et al.  [Least Angle Regression]: Discussion , 2004 .

[9]  Thomas L Casavant,et al.  Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubiquitin ligase, as a Bardet-Biedl syndrome gene (BBS11). , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[10]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[11]  Yang Feng,et al.  High-dimensional variable selection for Cox's proportional hazards model , 2010, 1002.3315.

[12]  Jianqing Fan,et al.  Sure independence screening in generalized linear models with NP-dimensionality , 2009, The Annals of Statistics.

[13]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[14]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[15]  Cun-Hui Zhang,et al.  Adaptive Lasso for sparse high-dimensional regression models , 2008 .

[16]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[17]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[18]  D. Hunter,et al.  Variable Selection using MM Algorithms. , 2005, Annals of statistics.

[19]  Babak Shahbaba,et al.  Nonlinear Models Using Dirichlet Process Mixtures , 2007, J. Mach. Learn. Res..

[20]  H. Akaike Maximum likelihood identification of Gaussian autoregressive moving average models , 1973 .

[21]  Jeffrey S. Morris,et al.  Sure independence screening for ultrahigh dimensional feature space Discussion , 2008 .

[22]  Hansheng Wang Forward Regression for Ultra-High Dimensional Variable Screening , 2009 .

[23]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[24]  Debashis Paul,et al.  A Regularized Hotelling’s T2 Test for Pathway Analysis in Proteomic Studies , 2011, Journal of the American Statistical Association.

[25]  Runze Li,et al.  Feature Screening via Distance Correlation Learning , 2012, Journal of the American Statistical Association.

[26]  B. Turlach Discussion of "Least Angle Regression" by Efron, Hastie, Johnstone and Tibshirani , 2004 .

[27]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[28]  Peter Hall,et al.  Using Generalized Correlation to Effect Variable Selection in Very High Dimensional Problems , 2009 .

[29]  V. Sheffield,et al.  Regulation of gene expression in the mammalian eye and its relevance to eye disease , 2006, Proceedings of the National Academy of Sciences.

[30]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[31]  Maria L. Rizzo,et al.  Measuring and testing dependence by correlation of distances , 2007, 0803.4101.

[32]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[33]  Yongdai Kim,et al.  Smoothly Clipped Absolute Deviation on High Dimensions , 2008 .