Identifiable Generative Models for Missing Not at Random Data Imputation

Real-world datasets often have missing values associated with complex generative processes, where the cause of the missingness may not be fully observed. This is known as missing not at random (MNAR) data. However, many imputation methods do not take into account the missingness mechanism, resulting in biased imputation values when MNAR data is present. Although there are a few methods that have considered the MNAR scenario, their model’s identifiability under MNAR is generally not guaranteed. That is, model parameters can not be uniquely determined even with infinite data samples, hence the imputation results given by such models can still be biased. This issue is especially overlooked by many modern deep generative models. In this work, we fill in this gap by systematically analyzing the identifiability of generative models under MNAR. Furthermore, we propose a practical deep generative model which can provide identifiability guarantees under mild assumptions, for a wide range of MNAR mechanisms. Our method demonstrates a clear advantage for tasks on both synthetic data and multiple real-world scenarios with MNAR data.

[1]  Donald B. Rubin,et al.  AN OVERVIEW OF MULTIPLE IMPUTATION , 2002 .

[2]  Julie Josse,et al.  Estimation and Imputation in Probabilistic Principal Component Analysis with Missing Not At Random Data , 2019, NeurIPS.

[3]  Judea Pearl,et al.  Missing Data as a Causal and Probabilistic Problem , 2015, UAI.

[4]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[5]  Orton,et al.  Multiple Imputation in Practice , 2001 .

[6]  Sebastian Nowozin,et al.  EDDI: Efficient Dynamic Discovery of High-Value Information with Partial VAE , 2018, ICML.

[7]  Aapo Hyvärinen,et al.  Variational Autoencoders and Nonlinear ICA: A Unifying Framework , 2019, AISTATS.

[8]  Judea Pearl,et al.  Graphical Models for Recovering Probabilistic and Causal Queries from Missing Data , 2014, NIPS.

[9]  Eric J. Tchetgen Tchetgen,et al.  Identification, Doubly Robust Estimation, and Semiparametric Efficiency Theory of Nonignorable Missing Data With a Shadow Variable , 2015 .

[10]  David M. Blei,et al.  The Blessings of Multiple Causes , 2018, Journal of the American Statistical Association.

[11]  Zichao Wang,et al.  Diagnostic Questions: The NeurIPS 2020 Education Challenge , 2020, ArXiv.

[12]  R. Little Pattern-Mixture Models for Multivariate Incomplete Data , 1993 .

[13]  Wei Ma,et al.  Missing Not at Random in Matrix Completion: The Effectiveness of Estimating Missingness Probabilities Under a Low Nuclear Norm Assumption , 2019, NeurIPS.

[14]  Thorsten Joachims,et al.  Recommendations as Treatments: Debiasing Learning and Evaluation , 2016, ICML.

[15]  Jared S. Murray,et al.  Multiple Imputation: A Review of Practical and Theoretical Findings , 2018, 1801.04058.

[16]  Werasak Kurutach,et al.  Cluster-based KNN missing value imputation for DNA microarray data , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[17]  J. Heckman Sample selection bias as a specification error , 1979 .

[18]  Richard S. Zemel,et al.  Collaborative prediction and ranking with non-random missing data , 2009, RecSys '09.

[19]  Jes Frellsen,et al.  not-MIWAE: Deep Generative Modelling with Missing not at Random Data , 2020, ICLR.

[20]  Julie Josse,et al.  Imputation and low-rank estimation with Missing Not At Random data , 2018, Statistics and Computing.

[21]  Jes Frellsen,et al.  MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets , 2019, ICML.

[22]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[23]  Joseph G. Ibrahim,et al.  Missing covariates in generalized linear models when the missing data mechanism is non‐ignorable , 1999 .

[24]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[25]  Zoubin Ghahramani,et al.  Probabilistic Matrix Factorization with Non-random Missing Data , 2014, ICML.

[26]  Rui Zhang,et al.  Doubly Robust Joint Learning for Recommendation on Data Missing Not at Random , 2019, ICML.

[27]  Wang Miao,et al.  Identification and inference with nonignorable missing covariate data. , 2018, Statistica Sinica.

[28]  Craig K. Enders,et al.  The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models , 2001 .

[29]  David M. Blei,et al.  Modeling User Exposure in Recommendation , 2015, WWW.

[30]  Chris Holmes,et al.  Deep Generative Missingness Pattern-Set Mixture Models , 2021, AISTATS.

[31]  Hude Quan,et al.  Bmc Medical Research Methodology Open Access Dealing with Missing Data in a Multi-question Depression Scale: a Comparison of Imputation Methods , 2022 .

[32]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[33]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[34]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[35]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[36]  David M. Blei,et al.  Causal Inference for Recommendation , 2016 .

[37]  Hedvig Kjellström,et al.  Causal discovery in the presence of missing data , 2018, AISTATS.

[38]  Pablo M. Olmos,et al.  Handling Incomplete Heterogeneous Data using VAEs , 2018, Pattern Recognit..

[39]  Sebastian Tschiatschek,et al.  VAEM: a Deep Generative Model for Heterogeneous Mixed Type Data , 2020, NeurIPS.

[40]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[41]  D. Rubin Formalizing Subjective Notions about the Effect of Nonrespondents in Sample Surveys , 1977 .

[42]  Zhi Geng,et al.  Identifiability of Normal and Normal Mixture Models with Nonignorable Missing Data , 2015, 1509.03860.

[43]  G. A. Marcoulides,et al.  Full Information Estimation in the Presence of Incomplete Data , 2013 .

[44]  Herbert W. Marsh,et al.  Pairwise Deletion for Missing Data in Structural Equation Models: Nonpositive Definite Matrices, Parameter Estimates, Goodness of Fit, and Adjusted Sample Sizes. , 1998 .

[45]  Jae Kwang Kim,et al.  An Instrumental Variable Approach for Identification and Estimation with Nonignorable Nonresponse , 2014 .

[46]  David M. Blei,et al.  The Deconfounded Recommender: A Causal Inference Approach to Recommendation , 2018, ArXiv.

[47]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[48]  Scott Sanner,et al.  AutoRec: Autoencoders Meet Collaborative Filtering , 2015, WWW.

[49]  Mihaela van der Schaar,et al.  GAIN: Missing Data Imputation using Generative Adversarial Nets , 2018, ICML.

[50]  José Miguel Hernández-Lobato,et al.  Partial VAE for Hybrid Recommender System , 2018 .

[51]  Michael R. Lyu,et al.  Response Aware Model-Based Collaborative Filtering , 2012, UAI.

[52]  Bo Jiang,et al.  MisGAN: Learning from Incomplete Data with Generative Adversarial Networks , 2019, ICLR.

[53]  John L.P. Thompson,et al.  Missing data , 2004, Amyotrophic lateral sclerosis and other motor neuron disorders : official publication of the World Federation of Neurology, Research Group on Motor Neuron Diseases.

[54]  Hedvig Kjellström,et al.  Advances in Variational Inference , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Gerhard Friedrich,et al.  Recommender Systems - An Introduction , 2010 .

[56]  Roderick J. A. Little,et al.  Analysis of multivariate missing data with nonignorable nonresponse , 2003 .

[57]  Jin Tian,et al.  Graphical Models for Inference with Missing Data , 2013, NIPS.

[58]  Sebastian Tschiatschek,et al.  HM-VAEs: a Deep Generative Model for Real-valued Data with Heterogeneous Marginals , 2019, AABI.

[59]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[60]  Jiawei He,et al.  Variational Selective Autoencoder: Learning from Partially-Observed Heterogeneous Data , 2021, AISTATS.

[61]  Per Winkel,et al.  When and how should multiple imputation be used for handling missing data in randomised clinical trials – a practical guide with flowcharts , 2017, BMC Medical Research Methodology.

[62]  Patrick Royston,et al.  Multiple imputation using chained equations: Issues and guidance for practice , 2011, Statistics in medicine.

[63]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[64]  T. Rothenberg Identification in Parametric Models , 1971 .