Forward regression for Cox models with high-dimensional covariates

Forward regression, a classical variable screening method, has been widely used for model building when the number of covariates is relatively low. However, forward regression is seldom used in high-dimensional settings because of the cumbersome computation and unknown theoretical properties. Some recent works have shown that forward regression, coupled with an extended Bayesian information criterion (EBIC)-based stopping rule, can consistently identify all relevant predictors in high-dimensional linear regression settings. However, the results are based on the sum of residual squares from linear models and it is unclear whether forward regression can be applied to more general regression settings, such as Cox proportional hazards models. We introduce a forward variable selection procedure for Cox models. It selects important variables sequentially according to the increment of partial likelihood, with an EBIC stopping rule. To our knowledge, this is the first study that investigates the partial likelihood-based forward regression in high-dimensional survival settings and establishes selection consistency results. We show that, if the dimension of the true model is finite, forward regression can discover all relevant predictors within a finite number of steps and their order of entry is determined by the size of the increment in partial likelihood. As partial likelihood is not a regular density-based likelihood, we develop some new theoretical results on partial likelihood and use these results to establish the desired sure screening properties. The practical utility of the proposed method is examined via extensive simulations and analysis of a subset of the Boston Lung Cancer Survival Cohort study, a hospital-based study for identifying biomarkers related to lung cancer patients' survival.

[1]  Jianqing Fan,et al.  Sure independence screening in generalized linear models with NP-dimensionality , 2009, The Annals of Statistics.

[2]  Ronghui Xu,et al.  USING PROFILE LIKELIHOOD FOR SEMIPARAMETRIC MODEL SELECTION WITH APPLICATION TO PROPORTIONAL HAZARDS MIXED MODELS. , 2009, Statistica Sinica.

[3]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[4]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[5]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[6]  A. Raftery,et al.  Bayesian Information Criterion for Censored Survival Models , 2000, Biometrics.

[7]  David C Christiani,et al.  Integrated powered density: Screening ultrahigh dimensional covariates with survival outcomes , 2018, Biometrics.

[8]  Ning Hao,et al.  Interaction Screening for Ultrahigh-Dimensional Data , 2014, Journal of the American Statistical Association.

[9]  Yi Li,et al.  Feature selection of ultrahigh-dimensional covariates with survival outcomes: a selective review , 2017, Applied mathematics : a journal of Chinese universities.

[10]  Thomas H. Scheike,et al.  Independent screening for single‐index hazard rate models with ultrahigh dimensional features , 2011, 1105.3361.

[11]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[12]  Zehua Chen,et al.  Sequential Lasso Cum EBIC for Feature Selection With Ultra-High Dimensional Feature Space , 2014 .

[13]  Lan Wang,et al.  Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data , 2013, 1304.2186.

[14]  H. Zou A note on path-based variable selection in the penalized proportional hazards model , 2008 .

[15]  L. J. Wei,et al.  The Robust Inference for the Cox Proportional Hazards Model , 1989 .

[16]  Yichao Wu,et al.  Ultrahigh Dimensional Feature Selection: Beyond The Linear Model , 2009, J. Mach. Learn. Res..

[17]  Charles E McCulloch,et al.  Relaxing the rule of ten events per variable in logistic and Cox regression. , 2007, American journal of epidemiology.

[18]  Jinfeng Xu,et al.  Extended Bayesian information criterion in the Cox model with a high-dimensional feature space , 2014, Annals of the Institute of Statistical Mathematics.

[19]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[20]  Toshio Honda,et al.  Forward Variable Selection for Sparse Ultra-High Dimensional Varying Coefficient Models , 2014, 1410.6556.

[21]  A. Belloni,et al.  L1-Penalised quantile regression in high-dimensional sparse models , 2009 .

[22]  Shuangge Ma,et al.  Censored Rank Independence Screening for High-dimensional Survival Data. , 2014, Biometrika.

[23]  Jason P. Fine,et al.  Comparing nonnested Cox models , 2002 .

[24]  Wenxuan Zhong,et al.  Correlation pursuit: forward stepwise variable selection for index models , 2012, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[25]  Cun-Hui Zhang,et al.  ORACLE INEQUALITIES FOR THE LASSO IN THE COX MODEL. , 2013, Annals of statistics.

[26]  J. Minna,et al.  Aberrant DNA methylation in lung cancer: biological and clinical implications. , 2002, The oncologist.

[27]  Hansheng Wang Forward Regression for Ultra-High Dimensional Variable Screening , 2009 .

[28]  Jeffrey S. Morris,et al.  Sure independence screening for ultrahigh dimensional feature space Discussion , 2008 .

[29]  Qi Zheng,et al.  Survival impact index and ultrahigh‐dimensional model‐free screening with survival outcomes , 2016, Biometrics.

[30]  Yi Li,et al.  Conditional screening for ultra-high dimensional covariates with survival outcomes , 2016, Lifetime data analysis.

[31]  Qi Zheng,et al.  GLOBALLY ADAPTIVE QUANTILE REGRESSION WITH ULTRA-HIGH DIMENSIONAL DATA. , 2015, Annals of statistics.

[32]  Jiahua Chen,et al.  Extended Bayesian information criteria for model selection with large model spaces , 2008 .

[33]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[34]  Yi Li,et al.  Principled sure independence screening for Cox models with ultra-high-dimensional covariates , 2012, J. Multivar. Anal..

[35]  O. Bousquet A Bennett concentration inequality and its application to suprema of empirical processes , 2002 .

[36]  T. Lai,et al.  A STEPWISE REGRESSION METHOD AND CONSISTENT MODEL SELECTION FOR HIGH-DIMENSIONAL SPARSE LINEAR MODELS , 2011 .

[37]  S. Geer HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO , 2008, 0804.0703.

[38]  Jianqing Fan,et al.  REGULARIZATION FOR COX'S PROPORTIONAL HAZARDS MODEL WITH NP-DIMENSIONALITY. , 2010, Annals of statistics.

[39]  S. Kong,et al.  Non-Asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso. , 2012, Statistica Sinica.

[40]  M. Talagrand Sharper Bounds for Gaussian and Empirical Processes , 1994 .