An Experimental Survey of Missing Data Imputation Algorithms

Due to the ubiquity of missing data, data imputation has received extensive attention in the past decades. It is a well-recognized problem impacting almost all fields of scientific study. Existing imputation algorithms differ in problem settings, model selection, and data evaluation. There is a lack of systematic comparison study among imputation algorithms. In this paper, we survey this interesting and evolving research topic by broadly reviewing and experimentally comparing the state-of-the-art missing data imputation algorithms. We analyze and categorize 19 imputation algorithms. Extensive experiments over 15 real-world benchmark datasets are conducted under various settings of data types, missing mechanisms, missing rates, dataset/model parameters, as well as the post-imputation prediction task. We shed light on a series of constructive insights on imputation algorithms to tackle imputation problem in real-life scenarios. Moreover, we put forward promising future directions for data imputation problem.

[1]  Xiaohui Cui,et al.  Tackling mode collapse in multi-generator GANs with orthogonal vectors , 2021, Pattern Recognit..

[2]  Jinyin Chen,et al.  Customizable text generation via conditional text generative adversarial network , 2020, Neurocomputing.

[3]  Michele Linardi,et al.  Effective and Efficient Variable-Length Data Series Analytics , 2020, PhD@VLDB.

[4]  Julie Josse,et al.  Missing Data Imputation using Optimal Transport , 2020, ICML.

[5]  Jianmin Wang,et al.  Enriching Data Imputation under Similarity Rule Constraints , 2020, IEEE Transactions on Knowledge and Data Engineering.

[6]  Paolo Papotti,et al.  Cleaning data with Llunatic , 2019, The VLDB Journal.

[7]  Xiaojie Yuan,et al.  E²GAN: End-to-End Generative Adversarial Network for Multivariate Time Series Imputation , 2019, IJCAI.

[8]  Sebastian Link,et al.  Embedded Functional Dependencies and Data-completeness Tailored Database Design , 2019, Proc. VLDB Endow..

[9]  Xinhong Chen,et al.  Event modeling and mining: a long journey toward explainable events , 2019, The VLDB Journal.

[10]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[11]  Jes Frellsen,et al.  MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets , 2019, ICML.

[12]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[13]  Shih-Fu Chang,et al.  CDSA: Cross-Dimensional Self-Attention for Multivariate, Geo-tagged Time Series Imputation , 2019, ArXiv.

[14]  Simone Scardapane,et al.  Missing Data Imputation with Adversarially-trained Graph Convolutional Networks , 2019, Neural Networks.

[15]  Yue Zhang,et al.  CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis] , 2019, ArXiv.

[16]  Aoqian Zhang,et al.  Learning Individual Models for Imputation , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[17]  Yisong Yue,et al.  NAOMI: Non-Autoregressive Multiresolution Sequence Imputation , 2019, NeurIPS.

[18]  Hayeong Song,et al.  Where's My Data? Evaluating Visualizations with Missing Data , 2019, IEEE Transactions on Visualization and Computer Graphics.

[19]  Lijun Sun,et al.  A Bayesian tensor decomposition approach for spatiotemporal traffic data imputation , 2019, Transportation Research Part C: Emerging Technologies.

[20]  Angshul Majumdar,et al.  Blind Denoising Autoencoder , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[21]  Yan Tian,et al.  LSTM-based traffic flow prediction with missing data , 2018, Neurocomputing.

[22]  Miriam Seoane Santos,et al.  Missing Data Imputation via Denoising Autoencoders: The Untold Story , 2018, IDA.

[23]  Frank Nielsen,et al.  Sinkhorn AutoEncoders , 2018, UAI.

[24]  Yuxin Chen,et al.  Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview , 2018, IEEE Transactions on Signal Processing.

[25]  Raul Castro Fernandez,et al.  FAHES: A Robust Disguised Missing Values Detector , 2018, KDD.

[26]  Pablo M. Olmos,et al.  Handling Incomplete Heterogeneous Data using VAEs , 2018, Pattern Recognit..

[27]  Sergio Escalera,et al.  Beyond One-hot Encoding: lower dimensional target embedding , 2018, Image Vis. Comput..

[28]  Mihaela van der Schaar,et al.  GAIN: Missing Data Imputation using Generative Adversarial Nets , 2018, ICML.

[29]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[30]  Lei Li,et al.  BRITS: Bidirectional Recurrent Imputation for Time Series , 2018, NeurIPS.

[31]  Felix Naumann,et al.  Discovery of Genuine Functional Dependencies from Relational Data with Missing Values , 2018, Proc. VLDB Endow..

[32]  Réjean Plamondon,et al.  Forgetting of unused classes in missing data environment using automatically generated data: Application to on-line handwritten gesture command recognition , 2017, Pattern Recognit..

[33]  Vincent Dumoulin,et al.  Generative Adversarial Networks: An Overview , 2017, 1710.07035.

[34]  Li Li,et al.  Adjusted weight voting algorithm for random forests in handling missing values , 2017, Pattern Recognit..

[35]  James She,et al.  Collaborative Variational Autoencoder for Recommender Systems , 2017, KDD.

[36]  Tareq Abed Mohammed,et al.  Understanding of a convolutional neural network , 2017, 2017 International Conference on Engineering and Technology (ICET).

[37]  Jong Hae Kim,et al.  Statistical data preparation: management of missing values and outliers , 2017, Korean journal of anesthesiology.

[38]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[39]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[40]  Gabriel Peyré,et al.  Learning Generative Models with Sinkhorn Divergences , 2017, AISTATS.

[41]  Lovedeep Gondara,et al.  Multiple Imputation Using Deep Denoising Autoencoders , 2017, ArXiv.

[42]  Matt J. Kusner,et al.  Grammar Variational Autoencoder , 2017, ICML.

[43]  Bo Zhao,et al.  Zero-Shot Learning Posed as a Missing Data Problem , 2016, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[44]  Fei-Yue Wang,et al.  An efficient realization of deep learning for traffic data imputation , 2016 .

[45]  Anna Stachowiak,et al.  Solving the problem of incomplete data in medical diagnosis via interval modeling , 2016, Appl. Soft Comput..

[46]  Carl Doersch,et al.  Tutorial on Variational Autoencoders , 2016, ArXiv.

[47]  Gang Chen,et al.  Top-k Dominating Queries on Incomplete Data , 2016, IEEE Transactions on Knowledge and Data Engineering.

[48]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[49]  Sanjay Krishnan,et al.  ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models , 2016, ArXiv.

[50]  Ruslan Salakhutdinov,et al.  Importance Weighted Autoencoders , 2015, ICLR.

[51]  Hong Cheng,et al.  TRIP: An Interactive Retrieving-Inferring Data Imputation Approach , 2015, IEEE Transactions on Knowledge and Data Engineering.

[52]  Hugo Larochelle,et al.  MADE: Masked Autoencoder for Distribution Estimation , 2015, ICML.

[53]  Agma J. M. Traina,et al.  Analyzing Missing Data in Metric Spaces , 2014, J. Inf. Data Manag..

[54]  Melody J. Bernot,et al.  Twitter in the Higher Education Classroom: A Student and Faculty Assessment of Use and Perception. , 2014 .

[55]  Shunsuke Managi,et al.  Global environmental emissions estimate: application of multiple imputation , 2014 .

[56]  Wei Cheng,et al.  Searching Dimension Incomplete Databases , 2014, IEEE Transactions on Knowledge and Data Engineering.

[57]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[58]  Nicolas Le Roux,et al.  A latent factor model for highly multi-relational data , 2012, NIPS.

[59]  Geoffrey Zweig,et al.  Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[60]  Patrick Royston,et al.  Multiple Imputation by Chained Equations (MICE): Implementation in Stata , 2011 .

[61]  Jérôme Pagès,et al.  Multiple imputation in principal component analysis , 2011, Adv. Data Anal. Classif..

[62]  Pascal Vincent,et al.  A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[63]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[64]  Leonardo Franco,et al.  Missing data imputation using statistical and machine learning methods in a real breast cancer problem , 2010, Artif. Intell. Medicine.

[65]  Ihab F. Ilyas,et al.  Supporting ranking queries on uncertain and incomplete data , 2010, The VLDB Journal.

[66]  Sushil Jajodia,et al.  Privacy in geo-social networks: proximity notification with untrusted service providers and curious buddies , 2010, The VLDB Journal.

[67]  Sunil Prabhakar,et al.  ERACER: a database approach for statistical inference and data cleaning , 2010, SIGMOD Conference.

[68]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[69]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[70]  Herbert Jaeger,et al.  Reservoir computing approaches to recurrent neural network training , 2009, Comput. Sci. Rev..

[71]  Bhekisipho Twala,et al.  AN EMPIRICAL COMPARISON OF TECHNIQUES FOR HANDLING INCOMPLETE DATA USING DECISION TREES , 2009, Appl. Artif. Intell..

[72]  Coskun Hamzaçebi,et al.  Improving artificial neural networks' performance in seasonal time series forecasting , 2008, Inf. Sci..

[73]  Peter J. Haug,et al.  Exploiting missing clinical data in Bayesian network modeling for predicting medical problems , 2008, J. Biomed. Informatics.

[74]  Witold Pedrycz,et al.  A Novel Framework for Imputation of Missing Values in Databases , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[75]  A. Barabasi,et al.  The human disease network , 2007, Proceedings of the National Academy of Sciences.

[76]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[77]  Minos N. Garofalakis,et al.  Adaptive cleaning for RFID data streams , 2006, VLDB.

[78]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[79]  Bhekisipho Twala,et al.  Comparison of various methods for handling incomplete data in software engineering databases , 2005, 2005 International Symposium on Empirical Software Engineering, 2005..

[80]  Mahesh Pal,et al.  Random forest classifier for remote sensing classification , 2005 .

[81]  T. Schneider Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. , 2001 .

[82]  D. Rubin,et al.  Estimating and Using Propensity Scores with Partially Missing Data , 2000 .

[83]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[84]  Shin'ichi Tamura,et al.  Capabilities of a four-layered feedforward neural network: four layers versus three , 1997, IEEE Trans. Neural Networks.

[85]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[86]  L. Franca,et al.  Error analysis of some Galerkin least squares methods for the elasticity equations , 1991 .

[87]  M. Saunders,et al.  Towards a Generalized Singular Value Decomposition , 1981 .

[88]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[89]  Lidia Auret,et al.  Variational Autoencoders for Missing Data Imputation with Application to a Simulated Milling Circuit , 2018 .

[90]  Ying Zhang,et al.  Multivariate Time Series Imputation with Generative Adversarial Networks , 2018, NeurIPS.

[91]  Gürsel Serpen,et al.  Complexity Analysis of Multilayer Perceptron Neural Network Embedded into a Wireless Sensor Network , 2014, Complex Adaptive Systems.

[92]  Hermann Ney,et al.  Cross-entropy vs. squared error training: a theoretical and experimental comparison , 2013, INTERSPEECH.

[93]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[94]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[95]  Sankar K. Pal,et al.  Fuzzy multi-layer perceptron, inferencing and rule generation , 1995, IEEE Trans. Neural Networks.