Handling missing values: A study of popular imputation packages in R

Abstract In real world data are often plagued by missing values which adversely affects the final outcome of the analysis based on such data. The missing values can be handled using various techniques like deletion or imputation. Of late, R has become one of the most preferred platform for carrying out data analysis, and its popularity is growing further. R provides various packages for handling missing values through imputation. The presence of multiple packages however, calls for an analysis of their comparative performance and examine their suitability for handling a given set of data. The performance of different R packages may differ for different datasets and may depend on the size of the dataset and richness of the missing values in the datasets. In this paper, the authors perform comparative study of the performance of the common R packages, namely VIM, MICE, MissForest, and HMISC, used for missing value imputation. The authors measured the performances of the said packages in terms of their imputation time, imputation efficiency and the effect on the variance. The imputation efficiency was measured in terms of the difference in predictive performance of a model built using original dataset vis-a-vis a dataset with imputed values. Similarly, the variance of the variables in the original dataset was compared that of corresponding variables in the imputed dataset. A missing value imputation package can be considered to be better if it consumes less imputation time and provides high imputation accuracy. Also in terms of variance, one would like to have the imputation package maintain the original variance of the variables. On analysing the four imputation packages on two datasets over three predictive algorithms–Logistic Regression, Support Vector Machines, and Artificial Neural Networks–it was observed that the performances varies depending on the size of the dataset, and the missing values present in them. The study highlights that certain missing value package used in conjunction with a given predictive algorithm provides better performance, which is again a function of the dataset characteristics.

[1]  Amitava Karmaker,et al.  Incorporating an EM-approach for handling missing attribute-values in decision tree induction , 2005, Fifth International Conference on Hybrid Intelligent Systems (HIS'05).

[2]  Alan Wee-Chung Liew,et al.  Missing value imputation for the analysis of incomplete traffic accident data , 2014, Inf. Sci..

[3]  Alessandro G. Di Nuovo,et al.  Missing data analysis with fuzzy C-Means: A study of its application in a psychological scenario , 2011, Expert Syst. Appl..

[4]  Rubiyah Yusof,et al.  FINNIM: Iterative Imputation of Missing Values in Dissolved Gas Analysis Dataset , 2014, IEEE Transactions on Industrial Informatics.

[5]  Francisco Herrera,et al.  On the choice of the best imputation methods for missing values considering three groups of classification methods , 2012, Knowledge and Information Systems.

[6]  Nicholas J. Horton,et al.  Multiple Imputation in Practice , 2001 .

[7]  Enrique Herrera-Viedma,et al.  GDM-R: A new framework in R to support fuzzy group decision making processes , 2016, Inf. Sci..

[8]  Ahmet Arslan,et al.  A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm , 2013, Inf. Sci..

[9]  Chih-Fong Tsai,et al.  Combining instance selection for better missing value imputation , 2016, J. Syst. Softw..

[10]  Jerome P. Reiter,et al.  Multiple imputation for missing data via sequential regression trees. , 2010, American journal of epidemiology.

[11]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[12]  Emmanuel John M. Carranza,et al.  Random forest predictive modeling of mineral prospectivity with small number of prospects and data with missing values in Abra (Philippines) , 2015, Comput. Geosci..

[13]  Jonathon N. Cummings,et al.  Multiple Imputation for Missing Data: Making the most of What you Know , 2003 .

[14]  Lluís A. Belanche Muñoz,et al.  Handling missing values in kernel methods with application to microbiology data , 2014, ESANN.

[15]  Yinhai Wang,et al.  A hybrid approach to integrate fuzzy C-means based imputation method with genetic algorithm for missing traffic volume data estimation , 2015 .

[16]  Jeremy MG Taylor,et al.  Partially parametric techniques for multiple imputation , 1996 .

[17]  Yong Zhou,et al.  A kernel-assisted imputation estimating method for the additive hazards model with missing censoring indicator , 2015 .

[18]  John B Carlin,et al.  Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation. , 2010, American journal of epidemiology.

[19]  Mehran Amiri,et al.  Missing data imputation using fuzzy-rough methods , 2016, Neurocomputing.

[20]  Yousung Park,et al.  A new multiple imputation method for bounded missing values , 2015 .

[21]  J. Graham,et al.  Missing data analysis: making it work in the real world. , 2009, Annual review of psychology.

[22]  Amaury Lendasse,et al.  Extreme learning machine for missing data using multiple imputations , 2016, Neurocomputing.

[23]  Yan Lin,et al.  Missing value imputation in high-dimensional phenomic data: imputable or not, and how? , 2014, BMC Bioinformatics.

[24]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[25]  R. Devi Priya,et al.  Heuristically repopulated Bayesian ant colony optimization for treating missing values in large databases , 2017, Knowl. Based Syst..

[26]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[27]  Enrique Herrera-Viedma,et al.  Fuzzy Group Decision Making With Incomplete Information Guided by Social Influence , 2018, IEEE Transactions on Fuzzy Systems.

[28]  Ali Ridho Barakbah,et al.  Optimization of missing value imputation using Reinforcement Programming , 2015, 2015 International Electronics Symposium (IES).

[29]  Witold Pedrycz,et al.  A Novel Framework for Imputation of Missing Values in Databases , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[30]  Alexander Kowarik,et al.  Imputation with the R Package VIM , 2016 .

[31]  Gustavo E. A. P. A. Batista,et al.  A Study of K-Nearest Neighbour as an Imputation Method , 2002, HIS.

[32]  Gang Chang,et al.  Comparison of missing data imputation methods for traffic flow , 2011, Proceedings 2011 International Conference on Transportation, Mechanical, and Electrical Engineering (TMEE).

[33]  Paul D. Allison,et al.  Handling Missing Data by Maximum Likelihood , 2012 .

[34]  Sasan H. Alizadeh,et al.  Using parametric regression and KNN algorithm with missing handling for software effort prediction , 2016, 2016 Artificial Intelligence and Robotics (IRANOPEN).

[35]  Elizabeth A Stuart,et al.  Multiple imputation with large data sets: a case study of the Children's Mental Health Initiative. , 2009, American journal of epidemiology.

[36]  Claudomiro Sales,et al.  Multi-objective genetic algorithm for missing data imputation , 2015, Pattern Recognit. Lett..

[37]  Minjin Kim,et al.  Using link-preserving imputation for logistic partially linear models with missing covariates , 2016, Comput. Stat. Data Anal..

[38]  Enrique Herrera-Viedma,et al.  Managing incomplete preference relations in decision making: A review and future trends , 2015, Inf. Sci..

[39]  Guy N. Brock,et al.  Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes , 2008, BMC Bioinformatics.

[40]  V. Kumutha,et al.  An enhanced approach on handling missing values using bagging k-NN imputation , 2013, 2013 International Conference on Computer Communication and Informatics.

[41]  Małgorzata Misztal,et al.  Imputation of Missing Data Using R Package , 2012 .

[42]  Judi Scheffer,et al.  Dealing with Missing Data , 2020, The Big R‐Book.

[43]  Francisco Chiclana,et al.  Multiplicative consistency of intuitionistic reciprocal preference relations and its application to missing values estimation and consensus building , 2014, Knowl. Based Syst..