A systematic review of machine learning-based missing value imputation techniques

PurposeThe primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?Design/methodology/approachThe review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.FindingsThis study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.Originality/valueIt is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.

[1]  Esther-Lydia Silva-Ramírez,et al.  Missing value imputation on missing completely at random data using multilayer perceptrons , 2011, Neural Networks.

[2]  Massimo Aria,et al.  Accurate Tree-based Missing Data Imputation and Data Fusion within the Statistical Learning Paradigm , 2012, J. Classif..

[3]  Chao Jiang,et al.  CKNNI: An Improved KNN-Based Missing Value Handling Technique , 2015, ICIC.

[4]  Cheng Wu,et al.  Robust LS-SVM regression for ore grade estimation in a seafloor hydrothermal sulphide deposit , 2013, Acta Oceanologica Sinica.

[5]  David A. Penn Estimating Missing Values from the General Social Survey: An Application of Multiple Imputation , 2007 .

[6]  T. Nguyen,et al.  A kernel PLS based classification method with missing data handling , 2017 .

[7]  Mengjie Zhang,et al.  Genetic Programming with Interval Functions and Ensemble Learning for Classification with Incomplete Data , 2018, Australasian Conference on Artificial Intelligence.

[8]  Yuxing Peng,et al.  Sample-Based Extreme Learning Machine Regression with Absent Data , 2015 .

[9]  Qi Chen,et al.  A Hybrid GP-KNN Imputation for Symbolic Regression with Missing Values , 2018, Australasian Conference on Artificial Intelligence.

[10]  Tshilidzi Marwala,et al.  Missing Data Estimation in High-Dimensional Datasets: A Swarm Intelligence-Deep Neural Network Approach , 2016, ICSI.

[11]  Namgil Lee,et al.  Block tensor train decomposition for missing data estimation , 2018 .

[12]  Bin Ran,et al.  Missing Value Imputation for Traffic-Related Time Series Data Based on a Multi-View Learning Method , 2019, IEEE Transactions on Intelligent Transportation Systems.

[13]  Frank Gauterin,et al.  Energetic Map Data Imputation: A Machine Learning Approach , 2020 .

[14]  Doheon Lee,et al.  Annotating activation/inhibition relationships to protein-protein interactions using gene ontology relations , 2018, BMC Systems Biology.

[15]  Jussi Paananen,et al.  Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study , 2019, BMC Bioinformatics.

[16]  Ilan Shimshoni,et al.  K-Means over Incomplete Datasets Using Mean Euclidean Distance , 2016, MLDM.

[17]  Samaher Al-Janabi,et al.  A nifty collaborative analysis to predicting a novel tool (DRFLLS) for missing values estimation , 2019, Soft Computing.

[18]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[19]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[20]  Bain Khusnul Khotimah A HYBRID SELF ORGANIZING MAP IMPUTATION (SOMI) WITH NAÏVE BAYES FOR IMPUTATION MISSING DATA CLASSIFICATION , 2019 .

[21]  Mehran Amiri,et al.  Missing data imputation using fuzzy-rough methods , 2016, Neurocomputing.

[22]  Uwe Aickelin,et al.  Imputation techniques on missing values in breast cancer treatment and fertility data , 2019, Health Information Science and Systems.

[23]  Stefan Conrad,et al.  Fuzzy Clustering of Incomplete Data Based on Cluster Dispersion , 2010, IPMU.

[24]  Xun Zhu,et al.  DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data , 2019, Genome Biology.

[25]  Sangdon Park,et al.  Learning-Based Adaptive Imputation Methodwith kNN Algorithm for Missing Power Data , 2017 .

[26]  Tshilidzi Marwala,et al.  Missing Data Estimation Using Cuckoo Search Algorithm , 2019 .

[27]  Lluís A. Belanche Muñoz,et al.  A kernel extension to handle missing data , 2009, SGAI Conf..

[28]  Urszula Bentkowska Optimization Problem of k-NN Classifier for Missing Values Case , 2020 .

[29]  Sabrina Eberhart,et al.  Applied Missing Data Analysis , 2016 .

[30]  Hasan Ogul,et al.  Microarray missing data imputation using regression , 2017, 2017 13th IASTED International Conference on Biomedical Engineering (BioMed).

[31]  Ninni Singh,et al.  Missing Value Imputation with Unsupervised Kohonen Self Organizing Map , 2015 .

[32]  Rahman Mm,et al.  Missing Value Imputation Using Stratified Supervised Learning for Cardiovascular Data , 2016 .

[33]  Dayang N. A. Jawawi,et al.  Missing Data Imputation Techniques for Software Effort Estimation: A Study of Recent Issues and Challenges , 2019, IRICT.

[34]  Negin Daneshpour,et al.  Missing value imputation using a novel grey based fuzzy c-means, mutual information based feature selection, and regression model , 2019, Expert Syst. Appl..

[35]  Anand Paul,et al.  Missing Data Imputation for Geolocation-based Price Prediction Using KNN-MCF Method , 2020, ISPRS Int. J. Geo Inf..

[36]  L. Lix,et al.  Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry , 2019, Health and Quality of Life Outcomes.

[37]  Marc K Walton,et al.  Addressing and Advancing the Problem of Missing Data , 2009, Journal of biopharmaceutical statistics.

[38]  Werasak Kurutach,et al.  An improvement of missing value imputation in DNA microarray data using cluster-based LLS method , 2013, 2013 13th International Symposium on Communications and Information Technologies (ISCIT).

[39]  Kenli Li,et al.  A multiple kernel density clustering algorithm for incomplete datasets in bioinformatics , 2018, BMC Systems Biology.

[40]  Chih-Fong Tsai,et al.  Missing value imputation: a review and analysis of the literature (2006–2017) , 2019, Artificial Intelligence Review.

[41]  Md. Shamsuzzoha Bayzid,et al.  Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices , 2019, BMC Genomics.

[42]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[43]  Xiaochen Lai,et al.  Imputation Using a Correlation-Enhanced Auto-Associative Neural Network with Dynamic Processing of Missing Values , 2019, ISNN.

[44]  Yu Zhang,et al.  Data Imputation of Wind Turbine Using Generative Adversarial Nets with Deep Learning Models , 2018, ICONIP.

[45]  Mohd. Najib Mohd. Salleh,et al.  A Study of Data Imputation Using Fuzzy C-Means with Particle Swarm Optimization , 2016, SCDM.

[46]  Durga Toshniwal,et al.  Missing Value Imputation Based on K-Mean Clustering with Weighted Distance , 2010, IC3.

[47]  Tshilidzi Marwala,et al.  Missing Data Estimation Using Ant-Lion Optimizer Algorithm , 2018, Studies in Big Data.

[48]  Alain Abran,et al.  Dealing with missing values in software project datasets: A systematic mapping study , 2016 .

[49]  Qi Chen,et al.  Genetic Programming-Based Simultaneous Feature Selection and Imputation for Symbolic Regression with Incomplete Data , 2019, ACPR.

[50]  Zhuo Su,et al.  Power Missing Data Filling Based on Improved k-Means Algorithm and RBF Neural Network , 2018, ICCCS.

[51]  Liyong Zhang,et al.  A Hierarchical Missing Value Imputation Method by Correlation-Based K-Nearest Neighbors , 2019, IntelliSys.

[52]  Roozbeh Razavi-Far,et al.  Similarity-learning information-fusion schemes for missing data imputation , 2020, Knowl. Based Syst..

[53]  Shichao Zhang,et al.  Shell-neighbor method and its application in missing data imputation , 2011, Applied Intelligence.

[54]  Vanathi Gopalakrishnan,et al.  An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data , 2017, Data.

[55]  Richard Hill,et al.  Best Fit Missing Value Imputation (BFMVI) Algorithm for Incomplete Data in the Internet of Things , 2020, IoTBDS.

[56]  Leonardo Franco,et al.  Missing data imputation using statistical and machine learning methods in a real breast cancer problem , 2010, Artif. Intell. Medicine.

[57]  Stef van Buuren,et al.  Flexible Imputation of Missing Data , 2012 .

[58]  Nouman Azam,et al.  A Game-Theoretic Rough Set Approach for Handling Missing Data in Clustering , 2018, IEA/AIE.

[59]  Qi Chen,et al.  Hessian Complexity Measure for Genetic Programming-Based Imputation Predictor Selection in Symbolic Regression with Incomplete Data , 2020, EuroGP.

[60]  K. Thangavel,et al.  Missing value imputation using unsupervised machine learning techniques , 2019, Soft Computing.

[61]  Katsuhiro Honda,et al.  An Ensemble Learning Approach Based on Missing-Valued Tables , 2015, RSFDGrC.

[62]  K. Shobha,et al.  Imputation of Multivariate Attribute Values in Big Data , 2019 .

[63]  Keiichi Shimizu,et al.  Latest trends in LED lighting , 2012 .

[64]  Van-Nam Huynh,et al.  k-CCM: A Center-Based Algorithm for Clustering Categorical Data with Missing Values , 2018, MDAI.

[65]  Ke Lu,et al.  Missing data imputation by K nearest neighbours based on grey relational structure and mutual information , 2015, Applied Intelligence.

[66]  Guy N. Brock,et al.  Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies , 2017, BMC Bioinformatics.

[67]  Chia-Yang Chang,et al.  Novel imputation for time series data , 2015, 2015 International Conference on Machine Learning and Cybernetics (ICMLC).

[68]  Nadia Essoussi,et al.  A New Way of Handling Missing Data in Multi-source Classification Based on Adaptive Imputation , 2018, MEDI.

[69]  Matej Oresic,et al.  Self-organization and missing values in SOM and GTM , 2015, Neurocomputing.

[70]  Susan E. Bedingfield,et al.  A Hybrid Missing Data Imputation Method for Constructing City Mobility Indices , 2018, AusDM.

[71]  Yuqing Ma,et al.  Data-driven missing data imputation in cluster monitoring system based on deep neural network , 2019, Applied Intelligence.

[72]  Kuen-Fang Jea,et al.  A Missing Data Imputation Method With Distance Function , 2018, 2018 International Conference on Machine Learning and Cybernetics (ICMLC).

[73]  K. Sasirekha,et al.  A Novel Fuzzy Rough Clustering Parameter-based missing value imputation , 2019, Neural Computing and Applications.

[74]  Wai Yan Lai,et al.  A Study on Bayesian Principal Component Analysis for Addressing Missing Rainfall Data , 2019, Water Resources Management.

[75]  Mohd Saberi Mohamad,et al.  A Review on Missing Value Imputation Algorithms for Microarray Gene Expression Data , 2014 .

[76]  Wojtek Kowalczyk,et al.  An Incremental Algorithm for Repairing Training Sets with Missing Values , 2016, IPMU.

[77]  Mengjie Zhang,et al.  Bagging and Feature Selection for Classification with Incomplete Data , 2017, EvoApplications.

[78]  Andrey Gorshenin,et al.  Application of Machine Learning Algorithms to Handle Missing Values in Precipitation Data , 2019, DCCN.

[79]  Cheng Lu,et al.  Affinity Propagation Clustering with Incomplete Data , 2014 .

[80]  Shohei Kato,et al.  A missing value imputation method using a Bayesian network with weighted learning , 2012 .

[81]  Anupam Ghosh,et al.  A Novel Transfer Learning-Based Missing Value Imputation on Discipline Diverse Real Test Datasets—A Comparative Study with Different Machine Learning Algorithms , 2018, Advances in Intelligent Systems and Computing.

[82]  Lovedeep Gondara,et al.  Random Forest with Random Projection to Impute Missing Gene Expression Data , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[83]  Xiao Xu,et al.  A Multi-directional Approach for Missing Value Estimation in Multivariate Time Series Clinical Data , 2020, Journal of Healthcare Informatics Research.

[84]  Masato Matsuo,et al.  Missing Data Imputation Using Regression Tree Model for Sparse Data Collected via Wide Area Ubiquitous Network , 2010, 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet.

[85]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[86]  Mehdi Moradi,et al.  Scandent Tree: A Random Forest Learning Method for Incomplete Multimodal Datasets , 2015, MICCAI.

[87]  Jeffrey S. Rosenthal,et al.  BEST: a decision tree algorithm that handles missing values , 2018, Comput. Stat..

[88]  Yanchang Zhao,et al.  Sharpening the BLADE: Missing Data Imputation Using Supervised Machine Learning , 2019, Australasian Conference on Artificial Intelligence.

[89]  Juncheng Zuo,et al.  Assessing the global averaged sea-level budget from 2003 to 2010 , 2013, Acta Oceanologica Sinica.

[90]  Md Zahidul Islam,et al.  Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques , 2013, Knowl. Based Syst..

[91]  Tsunenori Ishioka Imputation of missing values for semi-supervised data using the proximity in random forests , 2013, Int. J. Bus. Intell. Data Min..

[92]  Mohammad Saniee Abadeh,et al.  A Genetic Asexual Reproduction Optimization Algorithm for Imputing Missing Values , 2019, 2019 9th International Conference on Computer and Knowledge Engineering (ICCKE).

[93]  Sandeep Kumar Singh,et al.  DBSCANI: Noise-Resistant Method for Missing Value Imputation , 2016, J. Intell. Syst..

[94]  Ivan Jordanov,et al.  Feature Based Multivariate Data Imputation , 2018, LOD.

[95]  Tshilidzi Marwala,et al.  Missing Data Estimation Using Bat Algorithm , 2019 .

[96]  João Paulo Pordeus Gomes,et al.  Radial Basis Function Neural Networks for Datasets with Missing Values , 2016, ISDA.

[97]  Guo-Zheng Li,et al.  A hybrid imputation approach for microarray missing value estimation , 2015, BMC Genomics.

[98]  Roselina Sallehuddin,et al.  Missing data imputation with fuzzy feature selection for diabetes dataset , 2019, SN Applied Sciences.

[99]  Hadi A. Khorshidi,et al.  Missing data imputation using decision trees and fuzzy clustering with iterative learning , 2019, Knowledge and Information Systems.

[100]  Beatriz de la Iglesia,et al.  Multiple Imputation Ensembles (MIE) for Dealing with Missing Data , 2020, SN Computer Science.

[101]  Ramesh S. V. Teegavarapu,et al.  Missing precipitation data estimation using optimal proximity metric-based imputation, nearest-neighbour classification and cluster-based interpolation methods , 2014 .

[102]  Tshilidzi Marwala,et al.  A Deep Learning-Cuckoo Search Method for Missing Data Estimation in High-Dimensional Datasets , 2017, ICSI.

[103]  Song Gao,et al.  Particle Swarm Optimization Least Square Support Machine Based Missing Data Imputation Algorithm in Wireless Sensor Network for Nuclear Power Plant’s Environmental Radiation Monitor , 2012 .

[104]  Stathes Hadjiefthymiades,et al.  A Spatio-Temporal Data Imputation Model for Supporting Analytics at the Edge , 2019, I3E.

[105]  Sebastian Nowozin,et al.  Icebreaker: Element-wise Efficient Information Acquisition with a Bayesian Deep Latent Gaussian Model , 2019, NeurIPS.

[106]  K. Thangavel,et al.  Soft Clustering Based Missing Value Imputation , 2016 .

[107]  Gianni Bellocchi,et al.  Kriging-based approach to predict missing air temperature data , 2017, Comput. Electron. Agric..

[108]  Maya Herman,et al.  A Novel Algorithm for the Integration of the Imputation of Missing Values and Clustering , 2015, MLDM.

[109]  Gillian Dobbie,et al.  Improving Imputation Accuracy in Ordinal Data Using Classification , 2016, ISDA.

[110]  Tariq Samad,et al.  Imputation of Missing Data in Industrial Databases , 1999, Applied Intelligence.

[111]  Missing data : the hidden problem , 1998 .

[112]  Tshilidzi Marwala,et al.  Missing Data Estimation Using Invasive Weed Optimization Algorithm , 2019 .

[113]  Mir Mohsen Pedram,et al.  Missing Data Imputation by LOLIMOT and FSVM/FSVR Algorithms with a Novel Approach: A Comparative Study , 2018, IPMU.

[114]  Madhubala Myneni,et al.  Correlated Cluster-Based Imputation for Treatment of Missing Values , 2017 .

[115]  Hui-Hui Li,et al.  Semi-supervised imputation for microarray missing value estimation , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[116]  Werasak Kurutach,et al.  A cluster-directed framework for neighbour based imputation of missing value in microarray data , 2016, Int. J. Data Min. Bioinform..

[117]  Werasak Kurutach,et al.  Cluster-based KNN missing value imputation for DNA microarray data , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[118]  Jingjing Zhang,et al.  DAEimp: Denoising Autoencoder-Based Imputation of Sleep Heart Health Study for Identification of Cardiovascular Diseases , 2019, PRCV.

[119]  John L.P. Thompson,et al.  Missing data , 2004, Amyotrophic lateral sclerosis and other motor neuron disorders : official publication of the World Federation of Neurology, Research Group on Motor Neuron Diseases.

[120]  Paul Stothard,et al.  Whole genome SNP genotype piecemeal imputation , 2015, BMC Bioinformatics.

[121]  Ying Mei,et al.  An Imputation Method for Missing Data Based on an Extreme Learning Machine Auto-Encoder , 2018, IEEE Access.

[122]  Renchu Guan,et al.  MISC: missing imputation for single-cell RNA sequencing data , 2018, BMC Systems Biology.

[123]  Diego P. P. Mesquita,et al.  Artificial Neural Networks with Random Weights for Incomplete Datasets , 2019, Neural Processing Letters.

[124]  Roozbeh Razavi-Far,et al.  Imputation of missing data using fuzzy neighborhood density-based clustering , 2016, 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[125]  Mark P. Styczynski,et al.  NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data , 2018, Metabolomics.

[126]  Miriam Seoane Santos,et al.  Missing Data Imputation via Denoising Autoencoders: The Untold Story , 2018, IDA.

[127]  Pilar Rey-del-Castillo,et al.  Fuzzy min–max neural networks for categorical data: application to missing data imputation , 2012 .

[128]  Rui Zhang,et al.  Doubly Robust Joint Learning for Recommendation on Data Missing Not at Random , 2019, ICML.

[129]  Erica Tavazzi,et al.  A Combined Interpolation and Weighted K-Nearest Neighbours Approach for the Imputation of Longitudinal ICU Laboratory Data , 2020, J. Heal. Informatics Res..

[130]  Amaury Lendasse,et al.  Distance Estimation for Incomplete Data by Extreme Learning Machine , 2017 .

[131]  Larry J. Eshelman,et al.  A dynamic ensemble approach to robust classification in the presence of missing data , 2015, Machine Learning.

[132]  G. Madhu,et al.  A Novel Algorithm for Missing Data Imputation on Machine Learning , 2019, 2019 International Conference on Smart Systems and Inventive Technology (ICSSIT).

[133]  Kin Keung Lai,et al.  AdaBoost Models for Corporate Bankruptcy Prediction with Missing Data , 2016, Computational Economics.

[134]  T. V. Rajinikanth,et al.  A novel index measure imputation algorithm for missing data values: A machine learning approach , 2012, 2012 IEEE International Conference on Computational Intelligence and Computing Research.

[135]  Z. Su,et al.  Managing Missing Data in Patient Registries , 2018 .

[136]  Lourens J. Waldorp,et al.  A note on large-scale logistic prediction: using an approximate graphical model to deal with collinearity and missing data , 2017, Behaviormetrika.