DataWig: Missing Value Imputation for Tables

With the growing importance of machine learning (ML) algorithms for practical applications, reducing data quality problems in ML pipelines has become a major focus of research. In many cases missing values can break data pipelines which makes completeness one of the most impactful data quality challenges. Current missing value imputation methods are focusing on numerical or categorical data and can be difficult to scale to datasets with millions of rows. We release DataWig, a robust and scalable approach for missing value imputation that can be applied to tables with heterogeneous data types, including unstructured text. DataWig combines deep learning feature extractors with automatic hyperparameter tuning. This enables users without a machine learning background, such as data engineers, to impute missing values with minimal effort in tables with more heterogeneous data types than supported in existing libraries, while requiring less glue code for feature engineering and offering more flexible modelling options. We demonstrate that DataWig compares favourably to existing imputation packages. Source code, documentation, and unit tests for this package are available at: github.com/awslabs/datawig

[1]  Stef van Buuren,et al.  Flexible Imputation of Missing Data , 2012 .

[2]  Felix Bießmann,et al.  "Deep" Learning for Missing Value Imputationin Tables with Non-Numerical Data , 2018, CIKM.

[3]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[4]  Mihaela van der Schaar,et al.  GAIN: Missing Data Imputation using Generative Adversarial Nets , 2018, ICML.

[5]  Julie Josse,et al.  Miss , 2020, Definitions.

[6]  Lovedeep Gondara,et al.  Multiple Imputation Using Deep Denoising Autoencoders , 2017, ArXiv.

[7]  Radu State,et al.  Improving Missing Data Imputation with Deep Generative Models , 2019, ArXiv.

[8]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[9]  Pablo M. Olmos,et al.  Handling Incomplete Heterogeneous Data using VAEs , 2018, Pattern Recognit..

[10]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[11]  Alexander J. Smola,et al.  Detecting and Correcting for Label Shift with Black Box Predictors , 2018, ICML.

[12]  Patrick Seemann,et al.  Matrix Factorization Techniques for Recommender Systems , 2014 .

[13]  Jes Frellsen,et al.  MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets , 2019, ICML.

[14]  Pengtao Xie,et al.  Missing Value Imputation Based on Deep Generative Models , 2018, ArXiv.

[15]  Ahmed Eldawy,et al.  NADEEF: a commodity data cleaning system , 2013, SIGMOD '13.

[16]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[17]  Felix Bießmann,et al.  On Challenges in Machine Learning Model Management , 2018, IEEE Data Eng. Bull..

[18]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[19]  D. Sculley,et al.  Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[20]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[21]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..