Filling in the Blanks: Some Guesses Are Better Than Others

Imputation is the statistical process of filling in missing values with educated guesses to produce a complete data set. Among the objectives of imputation is the preservation of multivariate structure. What is the impact of common naïve imputation approaches when compared to that of a more sophisticated approach? Fully imputing responses to a survey questionnaire in preparation for data publication can be a major undertaking. Common challenges include complex skip patterns, complex patterns of missingness, a large number of variables, a variety of variable types (e.g., normal, transformable to normal, other continuous, count, Likert, other discrete ordered, Bernoulli, and multinomial), and both time and budget constraints. Faced with such challenges, a common approach is to simplify imputation by focusing on the preservation of a small number of multivariate structural features. For instance, a hot deck imputation scheme randomly selects respondents as donors for missing cases, and, similarly, a hot deck within cells procedure randomly selects donors within the same cell defined by a few categorical variables. To simplify the hot deck procedure, a separate hot deck with cells defined by a small common set of variables (e.g., age, race, and sex) might be used for each variable targeted for imputation. Another example in the context of a longitudinal survey might be to simply carry forward the last reported value for each target variable. Although such procedures are inexpensive and adequately preserve some important multivariate structural features, they may blur many other such features. Such blurring, of course, diminishes the value of the published data for researchers interested in a different set of structural features than those preserved by the data publisher’s imputation process. We have been working on imputation algorithms that preserve a larger number of multivariate structural features. Our algorithms allow some advance targeting of features to be preserved, but also try to discover and preserve strong unanticipated features in the hopes of better serving secondary data analysts. The discovery process is designed to work without human intervention and with only minimal human guidance. In this article, we illustrate the effect of our imputation algorithm compared to simpler algorithms. To do so, we use data from the National Education Longitudinal Survey (NELS), which is a longitudinal study of students conducted for the U.S. Department of Education’s National Center for Education Statistics. The NELS provides data about the experiences of a cohort of 8th-grade students in 1988 as they progress through middle and high schools and enter post-secondary institutions or the work force. The 1988 baseline survey was followed up at two-year intervals, from 1990 through 1994. In addition to student responses, the survey also collected data from parents, teachers, and principals. We use parent data (family income and religious affiliation) from the second follow up (1992) and student data (e.g., sexual behavior and expected educational attainment) from the third follow up (1994), by which time the modal student age was 20 years. This results in Filling in the Blanks: Some Guesses Are Better Than Others Illustrating the impact of covariate selection when imputing complex survey items