What do we do with missing data? Some options for analysis of incomplete data.

Missing data are a pervasive problem in many public health investigations. The standard approach is to restrict the analysis to subjects with complete data on the variables involved in the analysis. Estimates from such analysis can be biased, especially if the subjects who are included in the analysis are systematically different from those who were excluded in terms of one or more key variables. Severity of bias in the estimates is illustrated through a simulation study in a logistic regression setting. This article reviews three approaches for analyzing incomplete data. The first approach involves weighting subjects who are included in the analysis to compensate for those who were excluded because of missing values. The second approach is based on multiple imputation where missing values are replaced by two or more plausible values. The final approach is based on constructing the likelihood based on the incomplete observed data. The same logistic regression example is used to illustrate the basic concepts and methodology. Some software packages for analyzing incomplete data are described.

[1]  A. M'Kendrick Applications of Mathematics to Medical Problems , 1925, Proceedings of the Edinburgh Mathematical Society.

[2]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[3]  J. Rao On Variance Estimation with Imputed Survey Data , 1996 .

[4]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[5]  D. Rubin,et al.  Small-sample degrees of freedom with multiple imputation , 1999 .

[6]  Donald B. Rubin,et al.  Significance levels from repeated p-values with multiply imputed data , 1991 .

[7]  T. Raghunathan SHOULD IMPUTATION OF MISSING DATA CONDITION ON ALL OBSERVED VARIABLES? , 2002 .

[8]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[9]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[10]  L L Kupper,et al.  Selection bias in epidemiologic studies. , 1981, American journal of epidemiology.

[11]  Robert K. Triest,et al.  Alternative Methods for CPS Income Imputation , 1986 .

[12]  Werner Vach,et al.  Logistic Regression with Missing Values in the Covariates , 1994 .

[13]  Arthur B. Kennickell,et al.  Imputation of the 1989 Survey of Consumer Finances: Stochastic Relaxation and Multiple Imputation , 1997 .

[14]  D. Rubin,et al.  Large-sample significance levels from multiply imputed data using moment-based statistics and an F reference distribution , 1991 .

[15]  I Heuch,et al.  Selection bias in epidemiological studies of screening participants. , 1986, Journal of chronic diseases.

[16]  D. Rubin,et al.  Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse , 1986 .

[17]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[18]  David S. Siscovick,et al.  A multiple-imputation analysis of a case-control study of the risk of primary cardiac arrest among pharmacologicallytreated hypertensives , 1996 .

[19]  Roderick J. A. Little,et al.  Modeling the Drop-Out Mechanism in Repeated-Measures Studies , 1995 .

[20]  J. Shao,et al.  Jackknife variance estimation with survey data under hot deck imputation , 1992 .

[21]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[22]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[23]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[24]  R. Little Pattern-Mixture Models for Multivariate Incomplete Data , 1993 .

[25]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[26]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[27]  S. S. Wilks Moments and Distributions of Estimates of Population Parameters from Fragmentary Samples , 1932 .

[28]  D. Rubin,et al.  Handling “Don't Know” Survey Responses: The Case of the Slovenian Plebiscite , 1995 .

[29]  D. Binder On the variances of asymptotically normal estimators from complex surveys , 1983 .

[30]  Roderick J. A. Little Regression with Missing X's: A Review , 1992 .

[31]  D. Holt,et al.  Methods of weighting for unit non-response , 1991 .

[32]  D. Rubin Formalizing Subjective Notions about the Effect of Nonrespondents in Sample Surveys , 1977 .

[33]  R. Little,et al.  Maximum likelihood estimation for mixed continuous and categorical data with missing values , 1985 .

[34]  Trivellore E. Raghunathan,et al.  A Split Questionnaire Survey Design , 1995 .

[35]  Donald B. Rubin,et al.  Characterizing the Estimation of Parameters in Incomplete-Data Problems , 1974 .

[36]  D. Rubin INFERENCE AND MISSING DATA , 1975 .