McFlow: Monte Carlo Flow Models for Data Imputation

We consider the topic of data imputation, a foundational task in machine learning that addresses issues with missing data. To that end, we propose MCFlow, a deep framework for imputation that leverages normalizing flow generative models and Monte Carlo sampling. We address the causality dilemma that arises when training models with incomplete data by introducing an iterative learning scheme which alternately updates the density estimate and the values of the missing entries in the training data. We provide extensive empirical validation of the effectiveness of the proposed method on standard multivariate and image datasets, and benchmark its performance against state-of-the-art alternatives. We demonstrate that MCFlow is superior to competing methods in terms of the quality of the imputed data, as well as with regards to its ability to preserve the semantic structure of the data.

[1]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[2]  Tian Han,et al.  Alternating Back-Propagation for Generator Network , 2016, AAAI.

[3]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[4]  Stef van Buuren,et al.  Flexible Imputation of Missing Data, Second Edition , 2018 .

[5]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[6]  Carl Tim Kelley,et al.  Iterative methods for optimization , 1999, Frontiers in applied mathematics.

[7]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[8]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[9]  Yang Lu,et al.  Learning Generative ConvNets via Multi-grid Modeling and Sampling , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[11]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[12]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[13]  G. Evans,et al.  Learning to Optimize , 2008 .

[14]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[16]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[18]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[19]  Gustavo Deco,et al.  Higher Order Statistical Decorrelation without Information Loss , 1994, NIPS.

[20]  Pieter Abbeel,et al.  PixelSNAIL: An Improved Autoregressive Generative Model , 2017, ICML.

[21]  Ruslan Salakhutdinov,et al.  Importance Weighted Autoencoders , 2015, ICLR.

[22]  G. C. Wei,et al.  A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[23]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[24]  Mihaela van der Schaar,et al.  GAIN: Missing Data Imputation using Generative Adversarial Nets , 2018, ICML.

[25]  Jared S. Murray,et al.  Multiple Imputation: A Review of Practical and Theoretical Findings , 2018, 1801.04058.

[26]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[27]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[28]  Stef van Buuren,et al.  Flexible Imputation of Missing Data , 2012 .

[29]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[30]  Brendan J. Frey,et al.  Graphical Models for Machine Learning and Digital Communication , 1998 .

[31]  Jes Frellsen,et al.  missIWAE: Deep Generative Modelling and Imputation of Incomplete Data , 2018, ArXiv.

[32]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[33]  Bo Jiang,et al.  MisGAN: Learning from Incomplete Data with Generative Adversarial Networks , 2019, ICLR.

[34]  Max Welling,et al.  Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[35]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[36]  Pierre Comon,et al.  Handbook of Blind Source Separation: Independent Component Analysis and Applications , 2010 .

[37]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[38]  Masayoshi Takahashi,et al.  Statistical Inference in Missing Data by MCMC and Non-MCMC Multiple Imputation Algorithms: Assessing the Effects of Between-Imputation Iterations , 2017, Data Sci. J..

[39]  Marco Di Zio,et al.  Imputation through finite Gaussian mixture models , 2007, Comput. Stat. Data Anal..

[40]  Ronald C. Neath,et al.  On Convergence Properties of the Monte Carlo EM Algorithm , 2012, 1206.4768.

[41]  Julie Josse,et al.  Multiple imputation for continuous variables using a Bayesian principal component analysis† , 2014, 1401.5747.

[42]  Lovedeep Gondara,et al.  Multiple Imputation Using Deep Denoising Autoencoders , 2017, ArXiv.

[43]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[44]  Jes Frellsen,et al.  MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets , 2019, ICML.

[45]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[46]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[47]  Alexander Hehmeyer,et al.  Nonparametric Bayesian Multiple Imputation for Incomplete Categorical Variables in Large-Scale Assessment Surveys , 2013 .

[48]  Zhongheng Zhang,et al.  Missing data imputation: focusing on single imputation. , 2016, Annals of translational medicine.