A nonparametric multiple imputation approach for missing categorical data

BackgroundIncomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness) probabilities.MethodsWe propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model) and the other fits a logistic regression for predicting missingness probabilities (the missingness model). A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented.ResultsThe simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method.ConclusionsWe conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with more than two levels for assessing the distribution of the outcome. In terms of the choices for the working models, we suggest a multinomial logistic regression for predicting the missing outcome and a binary logistic regression for predicting the missingness probability.

[1]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[2]  C. Cassel,et al.  Some results on generalized difference estimation and generalized regression estimation for finite populations , 1976 .

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  Donald B. Rubin,et al.  Statistical Matching Using File Concatenation With Adjusted Weights and Multiple Imputations , 1986 .

[5]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[6]  Roderick J. A. Little,et al.  Multiple Imputation for the Fatal Accident Reporting System , 1991 .

[7]  S Greenland,et al.  A critical look at methods for handling missing covariates in epidemiologic regression analyses. , 1995, American journal of epidemiology.

[8]  Michael P. Jones Indicator and stratification methods for missing explanatory variables in multiple linear regression , 1996 .

[9]  J L Schafer,et al.  Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective. , 1998, Multivariate behavioral research.

[10]  Brian D. Ripley,et al.  Modern applied statistics with S, 4th Edition , 2002, Statistics and computing.

[11]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[12]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[13]  Paul T. von Hippel,et al.  HOW TO IMPUTE INTERACTIONS, SQUARES, AND OTHER TRANSFORMED VARIABLES , 2009 .

[14]  Ian R. White,et al.  Simsum: Analyses of Simulation Studies Including Monte Carlo Error , 2010 .

[15]  Qi Long,et al.  Doubly Robust Nonparametric Multiple Imputation for Ignorable Missing Data. , 2012, Statistica Sinica.

[16]  Yuehua Wu,et al.  Consistency of modified kernel regression estimation for functional data , 2012 .

[17]  Eberechukwu Onukwugha,et al.  Concordance between administrative claims and registry data for identifying metastasis to the bone: an exploratory analysis in prostate cancer , 2014, BMC Medical Research Methodology.

[18]  Patrick Royston,et al.  Tuning multiple imputation by predictive mean matching and local residual draws , 2014, BMC Medical Research Methodology.

[19]  Chiu-Hsieh Hsu,et al.  A Nonparametric Multiple Imputation Approach for Data with Missing Covariate Values with Application to Colorectal Adenoma Data , 2014, Journal of biopharmaceutical statistics.

[20]  G. Campus,et al.  Caries-risk profiles in Italian adults using computer caries assessment system and ICDAS. , 2015, Brazilian oral research.

[21]  Fan Jia,et al.  A Comparison of Imputation Strategies for Ordinal Missing Data on Likert Scale Variables , 2015, Multivariate behavioral research.

[22]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[23]  L Andries van der Ark,et al.  A comparison of incomplete-data methods for categorical data , 2016, Statistical methods in medical research.