ESTIMATION OF THE VARIANCE IN THE PRESENCE OF NEAREST NEIGHBOUR IMPUTATION

The nearest neighbour (NN) imputation method is used to supply substitutes for missing data in many surveys conducted at Statistics Canada. This trend will continue since the availability of a software such as the Generalized Edit and Imputation System (GELS) provides a relatively simple means of performing nearest neighbour imputation. Since an NN imputed value comes from a donor (one of the respondents), it is an actually occurring value, not a constructed value as in regression imputation. An NN imputed value may not be a perfect substitute, but is unlikely to be a nonsensical value. Normally, NN imputation yields point estimates with small or negligible bias, assuming that a linear relationship exists between the variable of interest y and the concomitant variable x used for nearest neighbour identification. When the survey estimate is calculated in part from imputed values, it is not trivial matter to produce a valid estimate of its variance. It is well known that the standard complete data variance estimator severely underestimates the true variance when applied to data with imputed values. In recent years, considerable attention has been given to this problem when single value imputation is used. For example, S~imdal (1990), Rao and Shao (1992), Rao and Sitter (1992), Kovar and Chen (1994), Lee, Rancourt and S~imdal (1994). These attempts were very successful for regression and mean imputation but for NN imputation suggested solutionshave been ad hoc. In this paper, we provide a more satisfactory solution to the variance estimation problem for NN imputation. There are basically three approaches to variance estimation in the presence of imputation. The oldest and probably best known method is multiple imputation (Rubin, 1977, 1987). Another is the model-assisted approach (S~irndal, 1990) and the third method is based on the jackknife technique (Rao, 1992). All the three approaches were tried for NN imputation by different authors with moderate success. With multiple imputation, there is some difficulty to define a "proper multiple imputation" for NN imputation and thus, the variance is underestimated (see Lee, Rancourt and S~imdal, 1994). These authors also tried the model-assisted approach pretending that formulae for ratio imputation would be applicable to NN imputation as well. This worked better than the multiple imputation, but the negative bias was still present and nonnegligible (see Lee et al., 1994). The jackknife technique has been used with some success for variance estimation when the data contain imputations. However, to produce the input for the jackknife formula (the estimate recalculated after deletion of one observation), the imputed values must first be adjusted. The appropriate adjustment depends on the particular imputation method used. In particular, a difficulty with the jackknife for NN imputation has been that no entirely satisfactory adjustment has yet been found. Kovar and Chen (1994) examined the jackknife technique for NN imputation using a less than ideal adjustment, namely, with the adjustment appropriate for ratio imputation. This method substantially reduced the bias of the standard complete data variance estimator but could not eliminate it. In this paper we develop an improved variance estimation technique for NN imputation. The method is model-assisted and gives correct variance estimation when the variable of interest y and the concomitant variable x are related with a linear regression through the origin. We obtain simple explicit estimators for the two components of the variance, that is, the sampling variance and the imputation variance. The theoretical results are presented in Section 2. In Section 3 we report the results of a Monte Carlo experiment which