Graphical Causal Models and Imputing Missing Data: A Preliminary Study

Real-world datasets often contain many missing values due to several reasons. This is usually an issue since many learning algorithms require complete datasets. In certain cases, there are constraints in the real world problem that create difficulties in continuously observing all data. In this paper, we investigate if graphical causal models can be used to impute missing values and derive additional information on the uncertainty of the imputed values. Our goal is to use the information from a complete dataset in the form of graphical causal models to impute missing values in an incomplete dataset. This assumes that the datasets have the same data generating process. Furthermore, we calculate the probability of each missing data value belonging to a specified percentile. We present a preliminary study on the proposed method using synthetic data, where we can control the causal relations and missing values.

[1]  Peter Bühlmann,et al.  Causal Inference Using Graphical Models with the R Package pcalg , 2012 .

[2]  Tom Heskes,et al.  Causal Discovery from Medical Data: Dealing with Missing Values and a Mixture of Discrete and Continuous Data , 2015, AIME.

[3]  S. van Buuren Multiple imputation of discrete and continuous data by fully conditional specification , 2007, Statistical methods in medical research.

[4]  M. Gorelick,et al.  Bias arising from missing data in predictive models. , 2006, Journal of clinical epidemiology.

[5]  J. Pearl,et al.  A statistical semantics for causation , 1992 .

[6]  Thomas S. Richardson,et al.  Learning high-dimensional directed acyclic graphs with latent and selection variables , 2011, 1104.5617.

[7]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[8]  D. Jonkers,et al.  Development and Feasibility Study of a Telemedicine Tool for All Patients with IBD: MyIBDcoach , 2017, Inflammatory bowel diseases.

[9]  Gerard M Schippers,et al.  Missing Data Approaches in eHealth Research: Simulation Study and a Tutorial for Nonmathematically Inclined Researchers , 2010, Journal of medical Internet research.

[10]  Tom Burr,et al.  Causation, Prediction, and Search , 2003, Technometrics.

[11]  Joris M. Mooij,et al.  Cyclic Causal Discovery from Continuous Equilibrium Data , 2013, UAI.

[12]  J. Mooij,et al.  Joint Causal Inference on Observational and Experimental Datasets , 2016, ArXiv.

[13]  K. Mohan,et al.  Graphical Representation of Missing Data Problems , 2015 .

[14]  Tom Heskes,et al.  Learning causal structure from mixed data with missing values using Gaussian copula models , 2018, Statistics and Computing.

[15]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[16]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[17]  Grégory Nuel,et al.  Joint estimation of causal effects from observational and intervention gene expression data , 2013, BMC Systems Biology.

[18]  D. Jonkers,et al.  Novel Perceived Stress and Life Events Precede Flares of Inflammatory Bowel Disease: A Prospective 12-Month Follow-Up Study , 2018, Journal of Crohn's & colitis.

[19]  Joris M. Mooij,et al.  Joint Causal Inference from Multiple Contexts , 2016, J. Mach. Learn. Res..

[20]  Tom Heskes,et al.  Copula PC Algorithm for Causal Discovery from Mixed Data , 2016, ECML/PKDD.

[21]  Judea Pearl,et al.  Graphical Models for Recovering Probabilistic and Causal Queries from Missing Data , 2014, NIPS.

[22]  Fan Li,et al.  Causal Inference: A Missing Data Perspective , 2017, 1712.06170.

[23]  A. Boonen,et al.  Telemedicine for management of inflammatory bowel disease (myIBDcoach): a pragmatic, multicentre, randomised controlled trial , 2017, The Lancet.

[24]  Naftali Harris,et al.  PC algorithm for nonparanormal graphical models , 2013, J. Mach. Learn. Res..