Effect of Missing Data Imputation on Prediction of Urinary Incontinence

Missing data imputation is an essential preprocessing step in clinical survey data mining applications. Rough set imputation is one way to handle missing data. A major advantage of using rough sets is that only the information presented in the dataset itself is sufficient to perform the analysis. Hence, no additional information, external parameters, models, functions, grades, or subjective interpretations are necessary. While there are several studies on rough set data imputation, none has been conducted to measure the effect of such imputation on prediction. In this paper, we generate several simulation datasets based on an existing epidemiological dataset (MESA) to perform such study. To measure how well each dataset lends itself to the prediction model, we have used p-values from the Wald test. To evaluate the accuracy of the prediction, we have considered the width of 95% confidence interval for the probability of incontinence. Both imputed and non-imputed simulation datasets were fit to the prediction model and they both turned out to be significant (p-value < 0.05). In addition, the Wald score shows a better fit for the imputed compared to non-imputed datasets. The average confidence interval width was decreased by 10.4% when the imputed dataset was used, i.e. higher precision was achieved. The results show that using the rough set method for missing data imputation on MESA data improves the predictive capability. Further studies are required to generalize this conclusion to other clinical survey datasets.