Model selection and structure specification in ultra-high dimensional generalised semi-varying coefficient models

In this paper, we study the model selection and structure specification for the generalised semi-varying coefficient models (GSVCMs), where the number of potential covariates is allowed to be larger than the sample size.We first propose a penalised likelihood method with the LASSO penalty function to obtain the preliminary estimates of the functional coefficients. Then, using the quadratic approximation for the local log-likelihood function and the adaptive group LASSO penalty (or the local linear approximation of the group SCAD penalty) with the help of the preliminary estimation of the functional coefficients, we introduce a novel penalised weighted least squares procedure to select the significant covariates and identify the constant coefficients among the coefficients of the selected covariates, which could thus specify the semiparametric modelling structure. The developed model selection and structure specification approach not only inherits many nice statistical properties from the local maximum likelihood estimation and nonconcave penalised likelihood method, but also computationally attractive thanks to the computational algorithm that is proposed to implement our method. Under some mild conditions, we establish the asymptotic properties for the proposed model selection and estimation procedure such as the sparsity and oracle property.We also conduct simulation studies to examine the finite sample performance of the proposed method, and finally apply the method to analyse a real data set, which leads to some interesting findings.

[1]  Jianqing Fan,et al.  Statistical Estimation in Varying-Coefficient Models , 1999 .

[2]  Jianqing Fan,et al.  Penalized composite quasi‐likelihood for ultrahigh dimensional variable selection , 2009, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[3]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[4]  H. Zou,et al.  STRONG ORACLE OPTIMALITY OF FOLDED CONCAVE PENALIZED ESTIMATION. , 2012, Annals of statistics.

[5]  J Schwartz,et al.  Short term fluctuations in air pollution and hospital admissions of the elderly for respiratory disease. , 1995, Thorax.

[6]  Jianqing Fan,et al.  Simultaneous Confidence Bands and Hypothesis Testing in Varying‐coefficient Models , 2000 .

[7]  Yang Feng,et al.  Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models , 2009, Journal of the American Statistical Association.

[8]  Jian Huang,et al.  Asymptotic oracle properties of SCAD-penalized least squares estimators , 2007, 0709.0863.

[9]  D. Strachan,et al.  Damp housing and childhood asthma; respiratory effects of indoor air temperature and relative humidity. , 1989, Journal of epidemiology and community health.

[10]  Jianqing Fan,et al.  Local polynomial modelling and its applications , 1994 .

[11]  Runze Li,et al.  NEW EFFICIENT ESTIMATION AND VARIABLE SELECTION METHODS FOR SEMIPARAMETRIC VARYING-COEFFICIENT PARTIALLY LINEAR MODELS. , 2011, Annals of statistics.

[12]  Jianqing Fan,et al.  Efficient Estimation and Inferences for Varying-Coefficient Models , 2000 .

[13]  J. Horowitz,et al.  Asymptotic properties of bridge estimators in sparse high-dimensional regression models , 2008, 0804.0693.

[14]  Hao Helen Zhang,et al.  ON THE ADAPTIVE ELASTIC-NET WITH A DIVERGING NUMBER OF PARAMETERS. , 2009, Annals of statistics.

[15]  Jianqing Fan,et al.  Nonconcave Penalized Likelihood With NP-Dimensionality , 2009, IEEE Transactions on Information Theory.

[16]  Runze Li,et al.  Feature Selection for Varying Coefficient Models With Ultrahigh-Dimensional Covariates , 2014, Journal of the American Statistical Association.

[17]  Svante Janson,et al.  Maximal spacings in several dimensions , 1987 .

[18]  Jianqing Fan,et al.  A SEMIPARAMETRIC MODEL FOR CLUSTER DATA. , 2009, Annals of statistics.

[19]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[20]  Runze Li,et al.  Local Rank Inference for Varying Coefficient Models , 2009, Journal of the American Statistical Association.

[21]  Yingcun Xia,et al.  Shrinkage Estimation of the Varying Coefficient Model , 2008 .

[22]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[23]  Jianqing Fan,et al.  Sure independence screening in generalized linear models with NP-dimensionality , 2009, The Annals of Statistics.

[24]  D. Hunter,et al.  Variable Selection using MM Algorithms. , 2005, Annals of statistics.

[25]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[26]  D. Strachan,et al.  Effects of air pollution on daily hospital admissions for respiratory disease in London between 1987-88 and 1991-92. , 1996, Journal of epidemiology and community health.

[27]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[28]  Ming-Yen Cheng,et al.  Statistical Estimation in Generalized Multiparameter Likelihood Models , 2009 .

[29]  Wenyang Zhang,et al.  Model selection and structure specification in ultra-high dimensional generalised semi-varying coefficient models , 2015, 1510.08683.

[30]  Feng Yi,et al.  On Varying-coefficient Independence Screening for High-dimensional Varying-coefficient Models. , 2014, Statistica Sinica.

[31]  Wenyang Zhang,et al.  Simultaneous confidence band and hypothesis test in generalised varying-coefficient models , 2010, J. Multivar. Anal..

[32]  Jian Huang,et al.  VARIABLE SELECTION AND ESTIMATION IN HIGH-DIMENSIONAL VARYING-COEFFICIENT MODELS. , 2011, Statistica Sinica.

[33]  Jianhua Z. Huang,et al.  Variable Selection in Nonparametric Varying-Coefficient Models for Analysis of Repeated Measurements , 2008, Journal of the American Statistical Association.

[34]  Heng Lian Variable selection for high-dimensional generalized varying-coefficient models , 2012 .

[35]  Yang Feng,et al.  Nonparametric independence screening via favored smoothing bandwidth , 2017, Journal of Statistical Planning and Inference.

[36]  Yichao Wu,et al.  Ultrahigh Dimensional Feature Selection: Beyond The Linear Model , 2009, J. Mach. Learn. Res..

[37]  Runze Li,et al.  Variable Selection in Semiparametric Regression Modeling. , 2008, Annals of statistics.

[38]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[39]  Yingying Fan,et al.  Tuning parameter selection in high dimensional penalized likelihood , 2013, 1605.03321.

[40]  Jialiang Li,et al.  A Semiparametric Threshold Model for Censored Longitudinal Data Analysis , 2011 .

[41]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[42]  H. Zou,et al.  One-step Sparse Estimates in Nonconcave Penalized Likelihood Models. , 2008, Annals of statistics.

[43]  Jialiang Li,et al.  Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data , 2013, 1308.3942.

[44]  Jiahua Chen,et al.  Extended Bayesian information criteria for model selection with large model spaces , 2008 .

[45]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[46]  Jianqing Fan,et al.  Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Varying Coefficient Models , 2014, Journal of the American Statistical Association.

[47]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[48]  Cun-Hui Zhang,et al.  The sparsity and bias of the Lasso selection in high-dimensional linear regression , 2008, 0808.0967.