Comparative Study of Various Methods of Handling Missing Data

Scientific literature lack straight forward answer as to the most suitable method for missing data imputation in terms of simplicity, accuracy and ease of use among the existing methods. Exploration various methods of data imputation is done, and then a robust method of data imputation is proposed. The paper uses simulated data sets generated for various distributions. A regression function on the simulated data sets is used and obtained the residual standard errors for the function obtained. Data are randomly from the set of independent variables to create artificial data-non response and use suitable methods to impute the missing data. The method of Mean, regression, hot and cold decking, multiple, median imputation, list wise deletion, EM algorithm and the nearest neighbour method are considered. This paper investigates the three most common traditional methods of handling missing data to establish the most optimal method. The suitability is hence determined by the method whose imputed data sample characteristic does not vary considerably from the original data set before imputation. The variation is here determined using the regression intercept and the residual standard error. R statistical package has been used widely in most of the regression cases. Microsoft excel is used to determine the correlation of columns in hot decking method; this is because it is readily available as a component of Microsoft package. The results from data analysis section indicated an intercept and R-squared values that closely mirror those of original data sets, suggesting that median imputation is a better data imputation method among the conventional methods. This finding is important from the research point of view, given the many cases of data missingness in scientific research. Finding and using the median is simple and as such most researchers have a ready tool at hand for handling missing data.

[1]  T. Raghunathan SHOULD IMPUTATION OF MISSING DATA CONDITION ON ALL OBSERVED VARIABLES? , 2002 .

[2]  R. Little Missing-Data Adjustments in Large Surveys , 1988 .

[3]  Charles Wafula,et al.  ROBUST ESTIMATION OF VARIANCE IN THE PRESENCE OF NEAREST NEIGHBOUR IMPUTATION , 2004 .

[4]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[5]  Robert Ferber,et al.  ITEM NONRESPONSE IN A CONSUMER SURVEY , 1966 .

[6]  Martin Biewen Item non-response and inequality measurement: Evidence from the German earnings distribution , 2001 .

[7]  Hyunshik Lee,et al.  ESTIMATION OF THE VARIANCE IN THE PRESENCE OF NEAREST NEIGHBOUR IMPUTATION , 2002 .

[8]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  S. K. Mehta,et al.  Implementing Multiple Imputation in an Automatic Variable Selection , 2007 .

[10]  Edgar Acuña,et al.  The Treatment of Missing Values and its Effect on Classifier Accuracy , 2004 .

[11]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[12]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[13]  Regina T. Riphahn,et al.  Item non-response on income and wealth questions , 2002, SSRN Electronic Journal.

[14]  J. Schafer,et al.  On the performance of multiple imputation for multivariate data with small sample size , 1999 .

[15]  Harish Chand,et al.  Household Wealth of the Elderly under Alternative Imputation Procedures , 1998 .

[16]  Determinants of Household’s Demand for Electricity in District Peshawar , 2010 .

[17]  Therese D. Pigott,et al.  A Review of Methods for Missing Data , 2001 .

[18]  Susanne Rässler,et al.  Survey item nonresponse and its treatment , 2006 .

[19]  Charles F. Manski,et al.  Partial identification with missing data: concepts and findings , 2005, Int. J. Approx. Reason..

[20]  Arthur B. Kennickell,et al.  Multiple imputation in the Survey of Consumer Finances , 2017 .

[21]  O. Bover The Spanish Survey of Household Finances (EFF): Description and Methods of the 2005 Wave , 2008 .

[22]  David L Streiner,et al.  The Case of the Missing Data: Methods of Dealing with Dropouts and other Research Vagaries , 2002, Canadian journal of psychiatry. Revue canadienne de psychiatrie.

[23]  Markus M. Grabka,et al.  Item nonresponse on income questions in panel surveys: Incidence, imputation and the impact on inequality and mobility , 2005 .

[24]  W. R. Buckland,et al.  Distributions in Statistics: Continuous Multivariate Distributions , 1973 .

[25]  Joachim Winter,et al.  Item non-response to financial questions in household surveys: An experimental study of interviewer and mode effects , 2003 .

[26]  D. Rubin,et al.  Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse , 1986 .

[27]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[28]  Melissa M. Farmer,et al.  Comparison of Two Multiple Imputation Procedures in a Cancer Screening Survey , 2021, Journal of Data Science.

[29]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[30]  L. Othuon Bias in regression coefficient estimates upon different treatments of systematically missing data , 2007 .

[31]  A. Cameron,et al.  Microeconometrics: Methods and Applications , 2005 .