Bayesian network data imputation with application to survival tree analysis

Retrospective clinical datasets are often characterized by a relatively small sample size and many missing data. In this case, a common way for handling the missingness consists in discarding from the analysis patients with missing covariates, further reducing the sample size. Alternatively, if the mechanism that generated the missing allows, incomplete data can be imputed on the basis of the observed data, avoiding the reduction of the sample size and allowing methods to deal with complete data later on. Moreover, methodologies for data imputation might depend on the particular purpose and might achieve better results by considering specific characteristics of the domain. The problem of missing data treatment is studied in the context of survival tree analysis for the estimation of a prognostic patient stratification. Survival tree methods usually address this problem by using surrogate splits, that is, splitting rules that use other variables yielding similar results to the original ones. Instead, our methodology consists in modeling the dependencies among the clinical variables with a Bayesian network, which is then used to perform data imputation, thus allowing the survival tree to be applied on the completed dataset. The Bayesian network is directly learned from the incomplete data using a structural expectation-maximization (EM) procedure in which the maximization step is performed with an exact anytime method, so that the only source of approximation is due to the EM formulation itself. On both simulated and real data, our proposed methodology usually outperformed several existing methods for data imputation and the imputation so obtained improved the stratification estimated by the survival tree (especially with respect to using surrogate splits). Retrospective clinical datasets have often small sample size and many missing data.We use Bayesian networks to impute missing data enhancing survival tree analysis.The Bayesian network is learned from incomplete data and used for the imputation.Our method generally achieved more accurate predictions than widely used approaches.

[1]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[2]  Wray L. Buntine Theory Refinement on Bayesian Networks , 1991, UAI.

[3]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[4]  Tomi Silander,et al.  A Simple Approach for Finding the Globally Optimal Bayesian Network Structure , 2006, UAI.

[5]  Nir Friedman,et al.  The Bayesian Structural EM Algorithm , 1998, UAI.

[6]  A. López-Guillermo,et al.  Nongastric marginal zone B-cell lymphoma of mucosa-associated lymphoid tissue. , 2003, Blood.

[7]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[8]  Linda C. van der Gaag,et al.  Probabilistic Graphical Models , 2014, Lecture Notes in Computer Science.

[9]  J. Peto,et al.  Asymptotically Efficient Rank Invariant Test Procedures , 1972 .

[10]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine Learning.

[11]  A. Ciampi,et al.  Stratification by stepwise regression, correspondence analysis and recursive partition: A comparison of three methods of analysis for survival data with covaria , 1986 .

[12]  A. J. Feelders,et al.  Learning Bayesian Network Models from Incomplete Data using Importance Sampling , 2005, AISTATS.

[13]  W. Chan,et al.  Genomic lesions associated with a different clinical outcome in diffuse large B‐Cell lymphoma treated with R‐CHOP‐21 , 2010, British journal of haematology.

[14]  M. LeBlanc,et al.  Relative risk trees for censored survival data. , 1992, Biometrics.

[15]  Kurt Hornik,et al.  Benchmarking Open-Source Tree Learners in R/RWeka , 2007, GfKl.

[16]  S. Keleş,et al.  Residual‐based tree‐structured survival analysis , 2002, Statistics in medicine.

[17]  Trevor Hastie,et al.  Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent. , 2011, Journal of statistical software.

[18]  Mitchell P. Marcus,et al.  Learning bayesian networks for solving real-world problems , 1998 .

[19]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[20]  Michael I. Jordan,et al.  Estimating Dependency Structure as a Hidden Variable , 1997, NIPS.

[21]  Tommi S. Jaakkola,et al.  Learning Bayesian Network Structure using LP Relaxations , 2010, AISTATS.

[22]  James Cussens,et al.  Bayesian network learning with cutting planes , 2011, UAI.

[23]  Qiang Ji,et al.  Properties of Bayesian Dirichlet Scores to Learn Bayesian Network Structures , 2010, AAAI.

[24]  Mark R. Segal,et al.  Regression Trees for Censored Data , 1988 .

[25]  Qiang Ji,et al.  Efficient Structure Learning of Bayesian Networks using Constraints , 2011, J. Mach. Learn. Res..

[26]  Antonio Salmerón,et al.  Multivariate Imputation of Qualitative Missing Data Using Bayesian Networks , 2004 .

[27]  Francesco Bertoni,et al.  Genome-wide DNA profiling of marginal zone lymphomas identifies subtype-specific lesions with an impact on the clinical outcome. , 2011, Blood.

[28]  Xiaogang Su,et al.  Multivariate exponential survival trees and their application to tooth prognosis , 2009, Comput. Stat. Data Anal..

[29]  E Graf,et al.  Assessment and comparison of prognostic classification schemes for survival data. , 1999, Statistics in medicine.

[30]  Paola Sebastiani,et al.  Learning Bayesian Networks from Incomplete Databases , 1997, UAI.

[31]  D.,et al.  Regression Models and Life-Tables , 2022 .

[32]  D. S. Sivia,et al.  Data Analysis , 1996, Encyclopedia of Evolutionary Psychological Science.

[33]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[34]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[35]  Carsten Riggelsen,et al.  Learning Bayesian Networks from Incomplete Data: An Efficient Method for Generating Approximate Predictive Distributions , 2006, SDM.

[36]  R B Davis,et al.  Exponential survival trees. , 1989, Statistics in medicine.

[37]  M. LeBlanc,et al.  Survival Trees by Goodness of Split , 1993 .

[38]  D. Nilsson,et al.  An efficient algorithm for finding the M most probable configurationsin probabilistic expert systems , 1998, Stat. Comput..

[39]  M. Scanu,et al.  Bayesian networks for imputation , 2004 .

[40]  Torsten Hothorn,et al.  Bagging survival trees , 2002, Statistics in medicine.