VARIABLE SELECTION FOR REGRESSION MODELS WITH MISSING DATA.

We consider the variable selection problem for a class of statistical models with missing data, including missing covariate and/or response data. We investigate the smoothly clipped absolute deviation penalty (SCAD) and adaptive LASSO and propose a unified model selection and estimation procedure for use in the presence of missing data. We develop a computationally attractive algorithm for simultaneously optimizing the penalized likelihood function and estimating the penalty parameters. Particularly, we propose to use a model selection criterion, called the IC(Q) statistic, for selecting the penalty parameters. We show that the variable selection procedure based on IC(Q) automatically and consistently selects the important covariates and leads to efficient estimates with oracle properties. The methodology is very general and can be applied to numerous situations involving missing data, from covariates missing at random in arbitrary regression models to nonignorably missing longitudinal responses and/or covariates. Simulations are given to demonstrate the methodology and examine the finite sample performance of the variable selection procedures. Melanoma data from a cancer clinical trial is presented to illustrate the proposed methodology.

[1]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[2]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[3]  Joseph Kang,et al.  Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data , 2007, 0804.2958.

[4]  T. Louis Finding the Observed Information Matrix When Using the EM Algorithm , 1982 .

[5]  J G Ibrahim,et al.  Parameter estimation from incomplete data in binomial regression when the missing data mechanism is nonignorable. , 1996, Biometrics.

[6]  Thomas R Belin,et al.  Imputation and Variable Selection in Linear Regression Models with Missing Covariates , 2005, Biometrics.

[7]  Runze Li,et al.  Tuning parameter selectors for the smoothly clipped absolute deviation method. , 2007, Biometrika.

[8]  J G Ibrahim,et al.  Monte Carlo EM for Missing Covariates in Parametric Regression Models , 1999, Biometrics.

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  Jianqing Fan,et al.  New Estimation and Model Selection Procedures for Semiparametric Modeling in Longitudinal Data Analysis , 2004 .

[11]  R. Little,et al.  Maximum likelihood estimation for mixed continuous and categorical data with missing values , 1985 .

[12]  H. Zou,et al.  One-step Sparse Estimates in Nonconcave Penalized Likelihood Models. , 2008, Annals of statistics.

[13]  Halbert White,et al.  Estimation, inference, and specification analysis , 1996 .

[14]  Jianqing Fan,et al.  Variable Selection for Cox's proportional Hazards Model and Frailty Model , 2002 .

[15]  P. McCullagh,et al.  Generalized Linear Models , 1972, Predictive Analytics.

[16]  S. Lipsitz,et al.  Missing responses in generalised linear mixed models when the missing data mechanism is nonignorable , 2001 .

[17]  Adrian E. Raftery,et al.  Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors , 1999 .

[18]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[19]  D. Andrews Generic Uniform Convergence , 1992, Econometric Theory.

[20]  J. Kirkwood,et al.  Interferon alfa-2b adjuvant therapy of high-risk resected cutaneous melanoma: the Eastern Cooperative Oncology Group Trial EST 1684. , 1996, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[21]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[22]  S. Geer,et al.  Regularization in statistics , 2006 .

[23]  D. Hunter,et al.  Variable Selection using MM Algorithms. , 2005, Annals of statistics.