Model selection and model averaging after multiple imputation

Model selection and model averaging are two important techniques to obtain practical and useful models in applied research. However, it is now well-known that many complex issues arise, especially in the context of model selection, when the stochastic nature of the selection process is ignored and estimates, standard errors, and confidence intervals are calculated as if the selected model was known a priori. While model averaging aims to incorporate the uncertainty associated with the model selection process by combining estimates over a set of models, there is still some debate over appropriate interpretation and confidence interval construction. These problems become even more complex in the presence of missing data and it is currently not entirely clear how to proceed. To deal with such situations, a framework for model selection and model averaging in the context of missing data is proposed. The focus lies on multiple imputation as a strategy to deal with the missingness: a consequent combination with model averaging aims to incorporate both the uncertainty associated with the model selection and with the imputation process. Furthermore, the performance of bootstrapping as a flexible extension to our framework is evaluated. Monte Carlo simulations are used to reveal the nature of the proposed estimators in the context of the linear regression model. The practical implications of our approach are illustrated by means of a recent survival study on sputum culture conversion in pulmonary tuberculosis.

[1]  David Draper,et al.  Assessment and Propagation of Model Uncertainty , 2011 .

[2]  K. Burnham,et al.  Model selection: An integral part of inference , 1997 .

[3]  Jun Yan,et al.  Enjoy the Joy of Copulas: With a Package copula , 2007 .

[4]  Alan T. K. Wan,et al.  Optimal Weight Choice for Frequentist Model Average Estimators , 2011 .

[5]  B. M. Pötscher,et al.  MODEL SELECTION AND INFERENCE: FACTS AND FICTION , 2005, Econometric Theory.

[6]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[7]  Gerhard Walzl,et al.  Baseline Predictors of Sputum Culture Conversion in Pulmonary Tuberculosis: Importance of Cavities, Smoking, Time to Detection and W-Beijing Genotype , 2012, PloS one.

[8]  C. Raghavendra Rao,et al.  On model selection , 2001 .

[9]  Michael Schomaker,et al.  Frequentist Model Averaging with missing observations , 2010, Comput. Stat. Data Anal..

[10]  J. Cavanaugh,et al.  An Akaike information criterion for model selection in the presence of incomplete data , 1998 .

[11]  B. Hansen Least Squares Model Averaging , 2007 .

[12]  C. Chatfield Model uncertainty, data mining and statistical inference , 1995 .

[13]  Xinyu Zhang,et al.  Weighted average least squares estimation with nonspherical disturbances and an application to the Hong Kong housing market , 2011, Comput. Stat. Data Anal..

[14]  Guohua Zou,et al.  Model averaging for varying-coefficient partially linear measurement error models , 2012 .

[15]  Haiying Wang,et al.  Frequentist model averaging estimation: a review , 2009, J. Syst. Sci. Complex..

[16]  Michael Schomaker,et al.  Journal of Quantitative Analysis in Sports Model Averaging in Factor Analysis : An Analysis of Olympic Decathlon Data , 2011 .

[17]  Hidetoshi Shimodaira A new criterion for selecting models from partially observed data , 1994 .

[18]  Paul Kabaila,et al.  On the Large-Sample Minimal Coverage Probability of Confidence Intervals After Model Selection , 2006 .

[19]  Jörg Drechsler,et al.  Does Convergence Really Matter , 2008 .

[20]  A Degrees-Of-Freedom approximation in Multiple imputation , 2002 .

[21]  B. M. Pötscher,et al.  CAN ONE ESTIMATE THE UNCONDITIONAL DISTRIBUTION OF POST-MODEL-SELECTION ESTIMATORS? , 2007, Econometric Theory.

[22]  G Molenberghs,et al.  Model selection for incomplete and design‐based samples , 2006, Statistics in medicine.

[23]  Gary King,et al.  Amelia II: A Program for Missing Data , 2011 .

[24]  G. King,et al.  What to Do about Missing Values in Time‐Series Cross‐Section Data , 2010 .

[25]  H. Leeb,et al.  CAN ONE ESTIMATE THE UNCONDITIONAL DISTRIBUTION OF POST-MODEL-SELECTION ESTIMATORS? , 2003, Econometric Theory.

[26]  D. Rubin,et al.  Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse , 1986 .

[27]  D. Rubin The Bayesian Bootstrap , 1981 .

[28]  Jan R. Magnus,et al.  A comparison of two model averaging techniques with an application to growth empirics , 2010 .

[29]  J. Sterne,et al.  Prognosis of patients with HIV-1 infection starting antiretroviral therapy in sub-Saharan Africa: a collaborative analysis of scale-up programmes , 2010, The Lancet.

[30]  HaiYing Wang,et al.  Interval Estimation by Frequentist Model Averaging , 2013 .

[31]  J. S. Rao,et al.  Detecting Differentially Expressed Genes in Microarrays Using Bayesian Model Selection , 2003 .

[32]  Alan T. K. Wan,et al.  Focused Information Criteria, Model Selection, and Model Averaging in a Tobit Model With a Nonzero Threshold , 2012 .

[33]  David Fletcher,et al.  Model-averaged Wald confidence intervals , 2012, Comput. Stat. Data Anal..

[34]  Christian Heumann,et al.  An efficient model averaging procedure for logistic regression models using a bayesian estimator with laplace prior , 2010 .

[35]  Michael Schomaker Shrinkage averaging estimation , 2012 .

[36]  Adrian E. Raftery,et al.  Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors , 1999 .

[37]  Patrick Royston,et al.  How should variable selection be performed with multiply imputed data? , 2008, Statistics in medicine.

[38]  Ken P Kleinman,et al.  Much Ado About Nothing , 2007, The American statistician.

[39]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[40]  Jeffrey S. Racine,et al.  Jackknife model averaging , 2012 .

[41]  Gerda Claeskens,et al.  Variable Selection with Incomplete Covariate Data , 2007, Biometrics.

[42]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[43]  Guohua Zou,et al.  Least squares model averaging by Mallows criterion , 2010 .

[44]  Nils Lid Hjort,et al.  Focused Information Criteria and Model Averaging for the Cox Hazard Regression Model , 2006 .

[45]  N. Hjort,et al.  Frequentist Model Average Estimators , 2003 .

[46]  Geert Molenberghs,et al.  Incomplete data: Introduction and overview , 2008 .

[47]  N. Hjort,et al.  The Focused Information Criterion , 2003 .

[48]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[49]  B. M. Pötscher The Distribution of Model Averaging Estimators and an Impossibility Result Regarding Its Estimation , 2006 .

[50]  Patrick Royston,et al.  Multiple imputation using chained equations: Issues and guidance for practice , 2011, Statistics in medicine.

[51]  David Fletcher,et al.  Model-averaged confidence intervals for factorial experiments , 2011, Comput. Stat. Data Anal..