Plausibility of multivariate normality assumption when multiply imputing non-Gaussian continuous outcomes: a simulation assessment

Multiple imputation under the assumption of multivariate normality has emerged as a frequently used model-based approach in dealing with incomplete continuous data in recent years. Despite its simplicity and popularity, however, its plausibility has not been thoroughly evaluated via simulation. In this work, the performance of multiple imputation under a multivariate Gaussian model with unstructured covariances was examined on a broad range of simulated incomplete data sets that exhibit varying distributional characteristics such as skewness and multimodality that are not accommodated by a Gaussian model. Behavior of efficiency and accuracy measures was explored to determine the extent to which the procedure works properly. The conclusion drawn is that although the real data rarely conform with multivariate normality, imputation under the assumption of normality is a fairly reasonable tool, even when the assumption of normality is clearly violated; the fraction of missing information is high, especially when the sample size is relatively large. Although we discourage its uncritical, automatic and, possibly, inappropriate use, we report that its performance is better than we expected, leading us to believe that it is probably an underrated approach.

[1]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[2]  Yulei He Multiple imputation for continuous non-normal missing data. , 2005 .

[3]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[4]  Hakan Demirtas,et al.  Multiple imputation under Bayesianly smoothed pattern‐mixture models for non‐ignorable drop‐out , 2005, Statistics in medicine.

[5]  D. Rubin,et al.  Multiple Imputation for Nonresponse in Surveys , 1989 .

[6]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[7]  Hakan Demirtas,et al.  Simulation driven inferences for multiply imputed longitudinal datasets * , 2004 .

[8]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[9]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[10]  Yulei He,et al.  Tukey's gh Distribution for Multiple Imputation , 2006 .

[11]  Joseph L Schafer,et al.  On the performance of random‐coefficient pattern‐mixture models for non‐ignorable drop‐out , 2003, Statistics in medicine.

[12]  J. Schafer,et al.  A comparison of inclusive and restrictive strategies in modern missing data procedures. , 2001, Psychological methods.

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[15]  J. Schafer,et al.  Computational Strategies for Multivariate Linear Mixed-Effects Models With Missing Values , 2002 .

[16]  J. Schafer Multiple imputation: a primer , 1999, Statistical methods in medical research.