Statistical primer: how to deal with missing data in scientific research?

Missing data are a common challenge encountered in research which can compromise the results of statistical inference when not handled appropriately. This paper aims to introduce basic concepts of missing data to a non-statistical audience, list and compare some of the most popular approaches for handling missing data in practice and provide guidelines and recommendations for dealing with and reporting missing data in scientific research. Complete case analysis and single imputation are simple approaches for handling missing data and are popular in practice, however, in most cases they are not guaranteed to provide valid inferences. Multiple imputation is a robust and general alternative which is appropriate for data missing at random, surpassing the disadvantages of the simpler approaches, but should always be conducted with care. The aforementioned approaches are illustrated and compared in an example application using Cox regression.

[1]  M. Kenward,et al.  Differential dropout and bias in randomised controlled trials: when it matters and when it may not , 2013, BMJ.

[2]  Theo Stijnen,et al.  Using the outcome for imputation of missing predictor values was preferred. , 2006, Journal of clinical epidemiology.

[3]  W Vach,et al.  Biased estimation of the odds ratio in case-control studies due to the use of ad hoc methods of correcting for missing values for confounding variables. , 1991, American journal of epidemiology.

[4]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[5]  Michael G. Kenward,et al.  Multiple Imputation and its Application: Carpenter/Multiple Imputation and its Application , 2013 .

[6]  R. Little,et al.  The prevention and treatment of missing data in clinical trials. , 2012, The New England journal of medicine.

[7]  H. Boshuizen,et al.  Multiple imputation of missing blood pressure covariates in survival analysis. , 1999, Statistics in medicine.

[8]  Dimitris Rizopoulos,et al.  Dealing with missing covariates in epidemiologic studies: a comparison between multiple imputation and a full Bayesian approach , 2016, Statistics in medicine.

[9]  Roderick J. A. Little Regression with Missing X's: A Review , 1992 .

[10]  Douglas G Altman,et al.  Developing a prognostic model in the presence of missing data: an ovarian cancer case study. , 2003, Journal of clinical epidemiology.

[11]  Mark Woodward,et al.  Imputations of missing values in practice: results from imputations of serum cholesterol in 28 cohort studies. , 2004, American journal of epidemiology.

[12]  Patrick Royston,et al.  Multiple imputation using chained equations: Issues and guidance for practice , 2011, Statistics in medicine.

[13]  Michael G. Kenward,et al.  Missing data in randomised controlled trials: a practical guide , 2007 .

[14]  Paul T. von Hippel Regression with missing Ys: An improved strategy for analyzing multiply imputed data , 2007 .

[15]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .