Covariate selection for multilevel models with missing data

Missing covariate data hampers variable selection in multilevel regression settings. Current variable selection techniques for multiply-imputed data commonly address missingness in the predictors through list-wise deletion and stepwise-selection methods which are problematic. Moreover, most variable selection methods are developed for independent linear regression models and do not accommodate multilevel mixed effects regression models with incomplete covariate data. We develop a novel methodology that is able to perform covariate selection across multiply-imputed data for multilevel random effects models when missing data is present. Specifically, we propose to stack the multiply-imputed data sets from a multiple imputation procedure and to apply a group variable selection procedure through group lasso regularization to assess the overall impact of each predictor on the outcome across the imputed data sets. Simulations confirm the advantageous performance of the proposed method compared with the competing methods. We applied the method to reanalyze the Healthy Directions-Small Business cancer prevention study, which evaluated a behavioral intervention program targeting multiple risk-related behaviors in a working-class, multi-ethnic population.

[1]  Todd E. Bodner,et al.  What Improves with Increased Missing Data Imputations? , 2008 .

[2]  J M Taylor,et al.  Multiple Imputation and Posterior Simulation for Multivariate Missing Data in Longitudinal Studies , 2000, Biometrics.

[3]  K. Emmons,et al.  The influence of social context on changes in fruit and vegetable consumption: results of the healthy directions studies. , 2007, American journal of public health.

[4]  M. C. Bueso,et al.  Stochastic complexity and model selection from incomplete data , 1999 .

[5]  Hao Helen Zhang,et al.  Variable Selection for Semiparametric Mixed Models in Longitudinal Studies , 2010, Biometrics.

[6]  Anne M Stoddard,et al.  Promoting behavior change among working-class, multiethnic workers: results of the healthy directions--small business study. , 2005, American journal of public health.

[7]  Ji Zhu,et al.  Doubly Regularized REML for Estimation and Selection of Fixed and Random Effects in Linear Mixed-Effects Models , 2010 .

[8]  G Molenberghs,et al.  Model selection for incomplete and design‐based samples , 2006, Statistics in medicine.

[9]  Willem van Mechelen,et al.  Variable selection under multiple imputation using the bootstrap in a prognostic study , 2007, BMC medical research methodology.

[10]  D. Dunson,et al.  Random Effects Selection in Linear Mixed Models , 2003, Biometrics.

[11]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[12]  Heping Zhang,et al.  Generalized score test of homogeneity for mixed effects models , 2006 .

[13]  Harvey Goldstein,et al.  Multilevel models with multivariate mixed response types , 2009 .

[14]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[15]  Thomas R Belin,et al.  Imputation and Variable Selection in Linear Regression Models with Missing Covariates , 2005, Biometrics.

[16]  Xihong Lin,et al.  Variance Component Testing in Generalized Linear Mixed Models for Longitudinal/Clustered Data and other Related Topics , 2008 .

[17]  Abd-Krim Seghouane,et al.  A criterion for model selection in the presence of incomplete data based on Kullback's symmetric divergence , 2005, Signal Process..

[18]  X. Niu,et al.  Selecting mixed-effects models based on a generalized information criterion , 2006 .

[19]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[20]  J. Ibrahim,et al.  Model Selection Criteria for Missing-Data Problems Using the EM Algorithm , 2008, Journal of the American Statistical Association.

[21]  Donald Hedeker,et al.  An imputation strategy for incomplete longitudinal ordinal data , 2008, Statistics in medicine.

[22]  Elizabeth A Stuart,et al.  American Journal of Epidemiology Practice of Epidemiology Multiple Imputation with Large Data Sets: a Case Study of the Children's Mental Health Initiative , 2022 .

[23]  Gerda Claeskens,et al.  Variable Selection with Incomplete Covariate Data , 2007, Biometrics.

[24]  Runze Li,et al.  Quadratic Inference Functions for Varying‐Coefficient Models with Longitudinal Data , 2006, Biometrics.

[25]  J. S. Rao,et al.  Fence methods for mixed model selection , 2008, 0808.0985.

[26]  Runze Li,et al.  Tuning parameter selectors for the smoothly clipped absolute deviation method. , 2007, Biometrika.

[27]  David B. Dunson,et al.  Bayesian Model Uncertainty in Mixed Effects Models , 2008 .

[28]  Jianqing Fan,et al.  New Estimation and Model Selection Procedures for Semiparametric Modeling in Longitudinal Data Analysis , 2004 .

[29]  Brent A. Johnson,et al.  Penalized Estimating Functions and Variable Selection in Semiparametric Regression Models , 2008, Journal of the American Statistical Association.

[30]  J. Ibrahim,et al.  Fixed and Random Effects Selection in Mixed Effects Models , 2011, Biometrics.

[31]  Patrick Royston,et al.  Multiple imputation using chained equations: Issues and guidance for practice , 2011, Statistics in medicine.

[32]  Patrick Royston,et al.  How should variable selection be performed with multiply imputed data? , 2008, Statistics in medicine.

[33]  Yingwen Dong,et al.  Estimating the predictive quality of dose-response after model selection. , 2007, Statistics in medicine.

[34]  Hongtu Zhu,et al.  VARIABLE SELECTION FOR REGRESSION MODELS WITH MISSING DATA. , 2010, Statistica Sinica.

[35]  Hidetoshi Shimodaira A new criterion for selecting models from partially observed data , 1994 .

[36]  J. Schafer,et al.  Computational Strategies for Multivariate Linear Mixed-Effects Models With Missing Values , 2002 .

[37]  H. Bondell,et al.  Joint Variable Selection for Fixed and Random Effects in Linear Mixed‐Effects Models , 2010, Biometrics.

[38]  Yingwen Dong Inference after model selection , 2007 .

[39]  Lu Tian,et al.  A Perturbation Method for Inference on Regularized Regression Estimates , 2011, Journal of the American Statistical Association.

[40]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[41]  Stephen W Lagakos,et al.  Inference after variable selection using restricted permutation methods , 2009, The Canadian journal of statistics = Revue canadienne de statistique.

[42]  Hao Helen Zhang,et al.  Component selection and smoothing in multivariate nonparametric regression , 2006, math/0702659.

[43]  R. Little,et al.  The prevention and treatment of missing data in clinical trials. , 2012, The New England journal of medicine.

[44]  Ciprian M. Crainiceanu,et al.  Likelihood Ratio Testing for Zero Variance Components in Linear Mixed Models , 2008 .

[45]  Hakan Demirtas,et al.  Simulation driven inferences for multiply imputed longitudinal datasets * , 2004 .

[46]  Recai M. Yucel,et al.  Multiple imputation inference for multivariate multilevel continuous data with ignorable non-response , 2008, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.