Variable selection for multiply‐imputed data with application to dioxin exposure study

Multiple imputation (MI) is a commonly used technique for handling missing data in large‐scale medical and public health studies. However, variable selection on multiply‐imputed data remains an important and longstanding statistical problem. If a variable selection method is applied to each imputed dataset separately, it may select different variables for different imputed datasets, which makes it difficult to interpret the final model or draw scientific conclusions. In this paper, we propose a novel multiple imputation‐least absolute shrinkage and selection operator (MI‐LASSO) variable selection method as an extension of the least absolute shrinkage and selection operator (LASSO) method to multiply‐imputed data. The MI‐LASSO method treats the estimated regression coefficients of the same variable across all imputed datasets as a group and applies the group LASSO penalty to yield a consistent variable selection across multiple‐imputed datasets. We use a simulation study to demonstrate the advantage of the MI‐LASSO method compared with the alternatives. We also apply the MI‐LASSO method to the University of Michigan Dioxin Exposure Study to identify important circumstances and exposure factors that are associated with human serum dioxin concentration in Midland, Michigan. Copyright © 2013 John Wiley & Sons, Ltd.

[1]  Xiaotong Shen,et al.  Likelihood-Based Selection and Sharp Parameter Estimation , 2012, Journal of the American Statistical Association.

[2]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[3]  A. Gelman,et al.  Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box , 2011 .

[4]  E. Stuart,et al.  Multiple imputation by chained equations: what is it and how does it work? , 2011, International journal of methods in psychiatric research.

[5]  Roderick J. A. Little,et al.  Estimation of Background Serum 2,3,7,8-TCDD Concentrations By Using Quantile Regression in the UMDES and NHANES Populations , 2010, Epidemiology.

[6]  Xiaotong Shen,et al.  Grouping Pursuit Through a Regularization Solution Surface , 2010, Journal of the American Statistical Association.

[7]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[8]  P. Zhao,et al.  The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[9]  Jinchi Lv,et al.  A unified approach to model selection and sparse recovery using regularized least squares , 2009, 0905.3573.

[10]  Tong Zhang,et al.  The Benefit of Group Sparsity , 2009, 0901.2962.

[11]  Peter Adriaens,et al.  The University of Michigan Dioxin Exposure Study: Population Survey Results and Serum Concentrations for Polychlorinated Dioxins, Furans, and Biphenyls , 2008, Environmental health perspectives.

[12]  Peter Adriaens,et al.  The University of Michigan Dioxin Exposure Study: Methods for an Environmental Exposure Study of Polychlorinated Dioxins, Furans, and Biphenyls , 2008, Environmental health perspectives.

[13]  Peter Adriaens,et al.  The University of Michigan Dioxin Exposure Study: Predictors of Human Serum Dioxin Concentrations in Midland and Saginaw, Michigan , 2008, Environmental health perspectives.

[14]  I. White,et al.  How should variable selection be performed with multiply imputed data? , 2008, Statistics in medicine.

[15]  G. Casella,et al.  The Bayesian Lasso , 2008 .

[16]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[17]  Xiao-Hua Zhou,et al.  Multiple imputation: review of theory, implementation and software , 2007, Statistics in medicine.

[18]  Paolo Ricci,et al.  Sarcoma risk and dioxin emissions from incinerators and industrial plants: a population-based case-control study (Italy) , 2007, Environmental health : a global access science source.

[19]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[20]  D. Rubin,et al.  Fully conditional specification in multivariate imputation , 2006 .

[21]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[22]  Thomas R Belin,et al.  Imputation and Variable Selection in Linear Regression Models with Missing Covariates , 2005, Biometrics.

[23]  K. Arisawa,et al.  Background exposure to PCDDs/PCDFs/PCBs and its potential health effects: a review of epidemiologic studies. , 2005, The journal of medical investigation : JMI.

[24]  Maria Teresa Landi,et al.  Immunologic effects of dioxin: new results from Seveso and comparison with other studies. , 2002, Environmental health perspectives.

[25]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[26]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[27]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[28]  W. Tierney,et al.  Multiple imputation in public health research , 2001, Statistics in medicine.

[29]  Dean Phillips Foster,et al.  Calibration and empirical Bayes variable selection , 2000 .

[30]  Wenjiang J. Fu,et al.  Asymptotics for lasso-type estimators , 2000 .

[31]  U. Ewers,et al.  Decrease of PCDD/F levels in human blood from Germany over the past ten years (1989-1998). , 2000, Chemosphere.

[32]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[33]  Xiao-Li Meng,et al.  Applications of multiple imputation in medical studies: from AIDS to NHANES , 1999, Statistical methods in medical research.

[34]  J. Schafer Multiple imputation: a primer , 1999, Statistical methods in medical research.

[35]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[36]  Xiao-Li Meng,et al.  Multiple-Imputation Inferences with Uncongenial Sources of Input , 1994 .

[37]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[38]  D. Rubin,et al.  Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse , 1986 .

[39]  L. Boniforti,et al.  Analysis of lipids and dioxin in chloracne due to tetrachloro‐2,5,7,8‐p‐dibenzodioxin , 1981, The British journal of dermatology.

[40]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[41]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[42]  H. Akaike A new look at the statistical model identification , 1974 .

[43]  Bin Wang,et al.  Dioxin exposure is an environmental risk factor for ischemic heart disease , 2007, Cardiovascular Toxicology.

[44]  D. Knol,et al.  Bmc Medical Research Methodology Open Access Variable Selection under Multiple Imputation Using the Bootstrap in a Prognostic Study , 2007 .

[45]  Richard Canady,et al.  Age Specific Dioxin TEQ Reference Range , 2004 .

[46]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[47]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[48]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.