Consistent Estimation of Generalized Linear Models with High Dimensional Predictors via Stepwise Regression

Predictive models play a central role in decision making. Penalized regression approaches, such as least absolute shrinkage and selection operator (LASSO), have been widely used to construct predictive models and explain the impacts of the selected predictors, but the estimates are typically biased. Moreover, when data are ultrahigh-dimensional, penalized regression is usable only after applying variable screening methods to downsize variables. We propose a stepwise procedure for fitting generalized linear models with ultrahigh dimensional predictors. Our procedure can provide a final model; control both false negatives and false positives; and yield consistent estimates, which are useful to gauge the actual effect size of risk factors. Simulations and applications to two clinical studies verify the utility of the method.

[1]  R. Wolff,et al.  Association of cigarette smoking and microRNA expression in rectal cancer: Insight into tumor phenotype. , 2016, Cancer epidemiology.

[2]  Cun-Hui Zhang,et al.  The sparsity and bias of the Lasso selection in high-dimensional linear regression , 2008, 0808.0967.

[3]  Thomas L Casavant,et al.  Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubiquitin ligase, as a Bardet-Biedl syndrome gene (BBS11). , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Ernst Wit,et al.  Differential geometric least angle regression: a differential geometric approach to sparse generalized linear models , 2013 .

[5]  Jiahua Chen,et al.  Extended Bayesian information criteria for model selection with large model spaces , 2008 .

[6]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[7]  Heping Zhang,et al.  Variable Selection With Prior Information for Generalized Linear Models via the Prior LASSO Method , 2016, Journal of the American Statistical Association.

[8]  Jeffrey S. Morris,et al.  Sure independence screening for ultrahigh dimensional feature space Discussion , 2008 .

[9]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[10]  Pedro M. Valero-Mora,et al.  ggplot2: Elegant Graphics for Data Analysis , 2010 .

[11]  Jian Huang,et al.  COORDINATE DESCENT ALGORITHMS FOR NONCONVEX PENALIZED REGRESSION, WITH APPLICATIONS TO BIOLOGICAL FEATURE SELECTION. , 2011, The annals of applied statistics.

[12]  Kaitai Zhang,et al.  MicroRNA-320b promotes colorectal cancer proliferation and invasion by competing with its homologous microRNA-320a. , 2015, Cancer letters.

[13]  Jianqing Fan,et al.  Sure independence screening in generalized linear models with NP-dimensionality , 2009, The Annals of Statistics.

[14]  H. Klocker,et al.  Serum levels of miR-320 family members are associated with clinical parameters and diagnosis in prostate cancer patients , 2017, Oncotarget.

[15]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[16]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[17]  Cun-Hui Zhang,et al.  Stepwise searching for feature variables in high-dimensional linear regression , 2008 .

[18]  Ernst Wit,et al.  Extended differential geometric LARS for high-dimensional GLMs with general dispersion parameter , 2017, Statistics and Computing.

[19]  Xin-jian Lin,et al.  MicroRNA-1225-5p inhibits proliferation and metastasis of gastric carcinoma through repressing insulin receptor substrate-1 and activation of β-catenin signaling , 2015, Oncotarget.

[20]  Yingying Fan,et al.  Tuning parameter selection in high dimensional penalized likelihood , 2013, 1605.03321.

[21]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[22]  S. Geer,et al.  On the asymptotic variance of the debiased Lasso , 2019, Electronic Journal of Statistics.

[23]  Renya Zhang,et al.  High expression of cytokeratin CAM5.2 in esophageal squamous cell carcinoma is associated with poor prognosis , 2019, Medicine.

[24]  Yi Li,et al.  Forward regression for Cox models with high-dimensional covariates , 2019, J. Multivar. Anal..

[25]  Cheryl J. Flynn,et al.  On the Sensitivity of the Lasso to the Number of Predictor Variables , 2014, 1403.4544.

[26]  Jinfeng Xu,et al.  Extended Bayesian information criterion in the Cox model with a high-dimensional feature space , 2014, Annals of the Institute of Statistical Mathematics.

[27]  M. Talagrand Sharper Bounds for Gaussian and Empirical Processes , 1994 .

[28]  Jing-Shiang Hwang,et al.  A stepwise regression algorithm for high-dimensional variable selection , 2015 .

[29]  T. Ochiya,et al.  Development and Validation of an Esophageal Squamous Cell Carcinoma Detection Model by Large-Scale MicroRNA Profiling , 2019, JAMA network open.

[30]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[31]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[32]  Zehua Chen,et al.  EXTENDED BIC FOR SMALL-n-LARGE-P SPARSE GLM , 2012 .

[33]  Qi Zheng,et al.  Building generalized linear models with ultrahigh dimensional features: A sequentially conditional approach , 2019, Biometrics.

[34]  M. Brock,et al.  Age and sex differences in the incidence of esophageal adenocarcinoma: results from the Surveillance, Epidemiology, and End Results (SEER) Registry (1973-2008). , 2014, Diseases of the esophagus : official journal of the International Society for Diseases of the Esophagus.

[35]  Yi Li,et al.  Principled sure independence screening for Cox models with ultra-high-dimensional covariates , 2012, J. Multivar. Anal..

[36]  T. Lai,et al.  A STEPWISE REGRESSION METHOD AND CONSISTENT MODEL SELECTION FOR HIGH-DIMENSIONAL SPARSE LINEAR MODELS , 2011 .

[37]  Marius Kwemou,et al.  Non-asymptotic oracle inequalities for the Lasso and Group Lasso in high dimensional logistic model , 2012, 1206.0710.

[38]  Qi Yu,et al.  Circulating microRNAs in esophageal squamous cell carcinoma: association with locoregional staging and survival. , 2015, International journal of clinical and experimental medicine.

[39]  Trevor Hastie,et al.  Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent. , 2011, Journal of statistical software.

[40]  V. Sheffield,et al.  Regulation of gene expression in the mammalian eye and its relevance to eye disease , 2006, Proceedings of the National Academy of Sciences.

[41]  Hansheng Wang Forward Regression for Ultra-High Dimensional Variable Screening , 2009 .

[42]  Yuwei Zhang,et al.  Epidemiology of esophageal cancer. , 2013, World journal of gastroenterology.

[43]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[44]  Jiang Bian,et al.  Big data hurdles in precision medicine and precision public health , 2018, BMC Medical Informatics and Decision Making.

[45]  Ernst Wit,et al.  dglars: An R Package to Estimate Sparse Generalized Linear Models , 2014 .

[46]  S. Geer HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO , 2008, 0804.0703.

[47]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[48]  S. Franceschi,et al.  EPIDEMIOLOGY OF ESOPHAGEAL CANCER , 2013 .

[49]  Jianqing Fan,et al.  Conditional Sure Independence Screening , 2012, Journal of the American Statistical Association.

[50]  Zehua Chen,et al.  Sequential Lasso Cum EBIC for Feature Selection With Ultra-High Dimensional Feature Space , 2014 .

[51]  Toshio Honda,et al.  Forward Variable Selection for Sparse Ultra-High Dimensional Varying Coefficient Models , 2014, 1410.6556.