EXPERIMENTS WITH VARIANCE ESTIMATION FROM SURVEY DATA WITH IMPUTED VALUES

1. S U M M A R Y In this report, we describe the methodology and the results of a Monte Carlo study of different variance estimators intended for six different methods of imputation. The imputation methods considered in the study were: 1). Single imputation by regression (KEG) 2). Single imputation by regression with added residual (KEGRF~) 3). Single imputation by regression with added standardized residual (REGRF~ST) 4). Single imputation by nearest neighbor (NN) 5). Multiple imputation by. regression with added residual MULT~a~G) 6). Multiple imputation by nearest neighbor (MULTNN) We used M = 2 repetitions for the multiple imputation methods 5 and 6. For each imputation method, we evaluated one or more variance estimators. A total of 10 variance estimators were included in the study. The simulations were carried out with 12 different populations representing a variety of relationships between x (the auxiliary variable used in the imputation) and y (the study variable). For each population, three different response mechanisms were used, leading to a total of 12 x 3 = 36 different cases. The objective was to identify variance estimators that perform reasonably well under a variety of conditions. Ideal performance under all possible circumstances seems impossible to attain. Some of the main conclusions are: 1). Concerning the point estimators corresponding to the six imputation methods: All imputation methods have a tolerable bias if the nonresponse is ignorable (that is, when the nonresponse occurs at random for given x; the precise definition is given in Rubin (1976)). However, all of the methods lead to a fairly substantial bias when the nonresponse is non-ignorable (that is, when the probability of nonresponse is systematically related to the variable of interest). Nearest neighbor imputation tends to produce a greater bias than regression imputation. 2). Concerning the variance estimators: None of the 10 variance estimators included in our study comes close to yielding unbiased estimates in all 36 cases. However, out of the ten variance estimators that we tested, there are a few whose overall performance can be termed acceptable. Their bias is fairly limited in all or most of the 36 cases, and they typically alternate between a mild overestimation and a mild underestimation. These methods, defined in detail in Section 2, are: KEGRES-SARN, REGRAO1 and KEG-RAO2 for single regression imputation; NN-SARN for single nearest neighbor imputation; the multiple imputation variance estimators M U L T R E G for multiple regression imputation and MULTNN for multiple nearest neighbor imputation. Some of the variance estimators we examined may work very well under the particular conditions for which they were designed. For instance, the methods REG-RAO1 and REGRAO2 (for single regression imputation) perform very well when the nonresponse is ignorable. The multiple imputation variance estimators are more variable than the other alternatives; consequently, the confidence intervals calculated with these methods have a more unpredictable length. This disadvantage is in addition to the heavy calculations caused by two or more imputations. Our study shows the difficulty of identifying variance estimators that have impeccable behavior under a variety of conditions. Our study also emphasizes that the variance estimators based on "standard formulas" must not be used. The standard estimators are based on an usually invalid assumption that imputed values have the same quality as observed values. These estimators lead to a considerable underestimation of the variance.