Impact of Dimension and Sample Size on the Performance of Imputation Methods

Real-world data collections often contain missing values, which can bring serious problems for data analysis. Simply discarding records with missing values tend to create bias in analysis. Missing data imputation methods try to fill in the missing values with estimated values. While numerous imputations methods have been proposed, these methods are mostly judged by their imputation accuracy, and little attention has been paid to their efficiency. With the increasing size of data collections, the imputation efficiency becomes an important issue. In this work we conduct an experimental comparison of several popular imputation methods, focusing on their time efficiency and scalability in terms of sample size and record dimension (number of attributes). We believe these results can provide a guide to data analysts when choosing imputation methods.

[1]  Jingrui He,et al.  Comparing Random Forest with Logistic Regression for Predicting Class-Imbalanced Civil War Onset Data , 2016, Political Analysis.

[2]  T. H. Bø,et al.  LSimpute: accurate estimation of missing values in microarray data with least squares methods. , 2004, Nucleic acids research.

[3]  G. Kistemaker,et al.  Comparison of different imputation methods , 2011 .

[4]  Werner Vach Missing Values: Statistical Theory and Computational Practice , 1994 .

[5]  Wei Meng,et al.  Evaluation of missing value imputation methods for wireless soil datasets , 2017, Personal and Ubiquitous Computing.

[6]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[7]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[8]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[9]  Nicholas L. Crookston,et al.  yaImpute: An R Package for kNN Imputation , 2008 .

[10]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[11]  Carol M Musil,et al.  A Comparison of Imputation Techniques for Handling Missing Data , 2002, Western journal of nursing research.

[12]  William Nick Street,et al.  Breast Cancer Diagnosis and Prognosis Via Linear Programming , 1995, Oper. Res..

[13]  H. Boshuizen,et al.  Multiple imputation of missing blood pressure covariates in survival analysis. , 1999, Statistics in medicine.

[14]  T. Schneider Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. , 2001 .

[15]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[16]  Yan Xiao,et al.  A Novel Method for Air Quality Data Imputation by Nuclear Norm Minimization , 2018, J. Sensors.

[17]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[18]  Alan Wee-Chung Liew,et al.  Missing value imputation for the analysis of incomplete traffic accident data , 2014, Inf. Sci..

[19]  Francisco Herrera,et al.  On the choice of the best imputation methods for missing values considering three groups of classification methods , 2012, Knowledge and Information Systems.

[20]  Oliver Rivero-Arias,et al.  Evaluation of software for multiple imputation of semi-continuous data , 2007, Statistical methods in medical research.

[21]  J. Marrero,et al.  Comparison of imputation methods for missing laboratory data in medicine , 2013, BMJ Open.

[22]  Edward R. Dougherty,et al.  Impact of Missing Value Imputation on Classification for DNA Microarray Gene Expression Data—A Model-Based Study , 2010, EURASIP J. Bioinform. Syst. Biol..

[23]  Yongjun Li,et al.  Data envelopment Analysis with Missing Data: a Multiple Linear Regression Analysis Approach , 2014, Int. J. Inf. Technol. Decis. Mak..

[24]  Andrew O. Finley,et al.  Applying an Efficient k-Nearest Neighbor Search to Forest Attribute Imputation , 2006, Forest Science.

[25]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[26]  Stef van Buuren,et al.  Multivariate Imputation by Chained Equations , 2015 .

[27]  Paulin Coulibaly,et al.  Comparison of Interpolation, Statistical, and Data-Driven Methods for Imputation of Missing Values in a Distributed Soil Moisture Dataset , 2014 .

[28]  Patrick Royston,et al.  Multiple imputation using chained equations: Issues and guidance for practice , 2011, Statistics in medicine.

[29]  Dieter William Joenssen Hot Deck Imputation Methods for Missing Data , 2015 .

[30]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[31]  Patrick Royston,et al.  Avoiding bias due to perfect prediction in multiple imputation of incomplete categorical variables☆ , 2010, Comput. Stat. Data Anal..

[32]  Steven D. Brown,et al.  Comparison of five iterative imputation methods for multivariate classification , 2013 .

[33]  Hong-Bin Shen,et al.  Towards better accuracy for missing value estimation of epistatic miniarray profiling data by a novel ensemble approach. , 2011, Genomics.

[34]  Mickael Guedj,et al.  A Comparison of Six Methods for Missing Data Imputation , 2015 .

[35]  Guy N. Brock,et al.  Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes , 2008, BMC Bioinformatics.

[36]  Xiaofeng Zhu,et al.  Efficient kNN Classification With Different Numbers of Nearest Neighbors , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[37]  黄 兆锋 A Comparison Study of Reconstruction and Multiple Imputation in Social Network Analysis , 2018 .

[38]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[39]  Constantine Frangakis,et al.  Multiple imputation by chained equations: what is it and how does it work? , 2011, International journal of methods in psychiatric research.

[40]  Joshua D. Potter,et al.  An Informed Forensics Approach to Detecting Vote Irregularities , 2015, Political Analysis.

[41]  Tariq Samad,et al.  Imputation of Missing Data in Industrial Databases , 1999, Applied Intelligence.