Missing Data Imputation using Optimal Transport

Missing data is a crucial issue when applying machine learning algorithms to real-world datasets. Starting from the simple assumption that two batches extracted randomly from the same dataset should share the same distribution, we leverage optimal transport distances to quantify that criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning, that can exploit or not parametric assumptions on the underlying distributions of values. We evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.

[1]  Madeleine Udell,et al.  Why Are Big Data Matrices Approximately Low Rank? , 2017, SIAM J. Math. Data Sci..

[2]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[3]  Marco Cuturi,et al.  Computational Optimal Transport , 2019 .

[4]  David Gries,et al.  From the Editors of this special issue , 2001, Inf. Process. Lett..

[5]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[6]  Trevor J. Hastie,et al.  Matrix completion and low-rank SVD via fast alternating least squares , 2014, J. Mach. Learn. Res..

[7]  Ruslan Salakhutdinov,et al.  Importance Weighted Autoencoders , 2015, ICLR.

[8]  Stef van Buuren,et al.  Flexible Imputation of Missing Data , 2012 .

[9]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[10]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[11]  Han Zhang,et al.  Improving GANs Using Optimal Transport , 2018, ICLR.

[12]  P. Rigollet,et al.  Optimal-Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming , 2019, Cell.

[13]  Jes Frellsen,et al.  MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets , 2019, ICML.

[14]  Tengyao Wang,et al.  High‐dimensional principal component analysis with heterogeneous missingness , 2019, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[15]  Jerome P. Reiter,et al.  Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models With Local Dependence , 2014, 1410.0438.

[16]  Hossein Mobahi,et al.  Learning with a Wasserstein Loss , 2015, NIPS.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Julie Josse,et al.  Miss , 2020, Definitions.

[19]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[20]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[21]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[22]  J. Josse,et al.  missMDA: A Package for Handling Missing Values in Multivariate Data Analysis , 2016 .

[23]  Mihaela van der Schaar,et al.  GAIN: Missing Data Imputation using Generative Adversarial Nets , 2018, ICML.

[24]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[25]  Nicolas Courty,et al.  Learning with minibatch Wasserstein : asymptotic and gradient properties , 2020, AISTATS.

[26]  Judea Pearl,et al.  Graphical Models for Processing Missing Data , 2018, Journal of the American Statistical Association.

[27]  Arnaud Doucet,et al.  Fast Computation of Wasserstein Barycenters , 2013, ICML.

[28]  Gabriel Peyré,et al.  Learning Generative Models with Sinkhorn Divergences , 2017, AISTATS.

[29]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[30]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[31]  Dmitry Vetrov,et al.  Variational Autoencoder with Arbitrary Conditioning , 2018, ICLR.

[32]  Dan Jackson,et al.  What Is Meant by "Missing at Random"? , 2013, 1306.2812.

[33]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[34]  Alain Trouvé,et al.  Interpolating between Optimal Transport and MMD using Sinkhorn Divergences , 2018, AISTATS.

[35]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..