A new multivariate imputation method based on Bayesian networks

Dealing with incomplete data is a pervasive problem in statistical surveys. Bayesian networks have been recently used in missing data imputation. In this research, we propose a new methodology for the multivariate imputation of missing data using discrete Bayesian networks and conditional Gaussian Bayesian networks. Results from imputing missing values in coronary artery disease data set and milk composition data set as well as a simulation study from cancer-neapolitan network are presented to demonstrate and compare the performance of three Bayesian network-based imputation methods with those of multivariate imputation by chained equations (MICE) and the classical hot-deck imputation method. To assess the effect of the structure learning algorithm on the performance of the Bayesian network-based methods, two methods called Peter-Clark algorithm and greedy search-and-score have been applied. Bayesian network-based methods are: first, the method introduced by Di Zio et al. [Bayesian networks for imputation, J. R. Stat. Soc. Ser. A 167 (2004), 309–322] in which, each missing item of a variable is imputed using the information given in the parents of that variable; second, the method of Di Zio et al. [Multivariate techniques for imputation based on Bayesian networks, Neural Netw. World 15 (2005), 303–310] which uses the information in the Markov blanket set of the variable to be imputed and finally, our new proposed method which applies the whole available knowledge of all variables of interest, consisting the Markov blanket and so the parent set, to impute a missing item. Results indicate the high quality of our new proposed method especially in the presence of high missingness percentages and more connected networks. Also the new method have shown to be more efficient than the MICE method for small sample sizes with high missing rates.

[1]  Judea Pearl,et al.  Chapter 2 – BAYESIAN INFERENCE , 1988 .

[2]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[3]  Wray L. Buntine Operations for Learning with Graphical Models , 1994, J. Artif. Intell. Res..

[4]  Yang Yuan,et al.  Multiple Imputation Using SAS Software , 2011 .

[5]  Marek J. Druzdzel,et al.  Robust Independence Testing for Constraint-Based Learning of Causal Structure , 2002, UAI.

[6]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[7]  David Maxwell Chickering,et al.  Optimal Structure Identification With Greedy Search , 2003, J. Mach. Learn. Res..

[8]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[9]  Robert G. Cowell,et al.  Conditions Under Which Conditional Independence and Scoring Methods Lead to Identical Selection of Bayesian Network Models , 2001, UAI.

[10]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[11]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[12]  S. Lauritzen The EM algorithm for graphical association models with missing data , 1995 .

[13]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine Learning.

[14]  Franz von Kutschera,et al.  Causation , 1993, J. Philos. Log..

[15]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[16]  Michael I. Jordan Graphical Models , 1998 .

[17]  Gregory F. Cooper,et al.  NESTOR: A Computer-Based Medical Diagnostic Aid That Integrates Causal and Probabilistic Knowledge. , 1984 .

[18]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[19]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[20]  M. Scanu,et al.  Bayesian networks for imputation , 2004 .

[21]  William E. Winkler,et al.  Bayesian Networks Representations, Generalized Imputation, and Synthetic Micro-data Satisfying Analytic Constraints , 2002 .

[22]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[23]  Weiru Liu,et al.  Learning belief networks from data: an information theory based approach , 1997, CIKM '97.

[24]  Paola Vicard,et al.  Multivariate techniques for imputation based on Bayesian networks , 2005 .

[25]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[26]  Charlotte Lauridsen,et al.  Lactational dietary fat levels and sources influence milk composition and performance of sows and their progeny , 2004 .

[27]  Jie Cheng,et al.  Learning Bayesian Networks from Data: An Efficient Approach Based on Information Theory , 1999 .