Variational Autoencoder Modular Bayesian Networks (VAMBN) for Simulation of Heterogeneous Clinical Study Data

In the area of Big Data one of the major obstacles for the progress of biomedical research is the existence of data “silos”, because legal and ethical constraints often do not allow for sharing sensitive patient data from clinical studies across institutions. While federated machine learning now allows for building models from scattered data, there is still the need to investigate, mine and understand clinical data that cannot be accessed directly. Simulation of sufficiently realistic virtual patients could be a way to fill this gap. In this work we propose a new machine learning approach (VAMBN) to learn a generative model of longitudinal clinical study data. VAMBN considers typical key aspects of such data, namely limited sample size coupled with comparable many variables of different numerical scales and statistical properties, and many missing values. We show that with VAMBN we can simulate virtual patients in a sufficiently realistic manner while making theoretical guarantees on data privacy. In addition, VAMBN allows for simulating counterfactual scenarios. Hence, VAMBN could facilitate data sharing as well as design of clinical trials.

[1]  David Heckerman,et al.  Bayesian Networks for Data Mining , 2004, Data Mining and Knowledge Discovery.

[2]  Holger Fröhlich,et al.  From hype to reality: data science enabling personalized medicine , 2018, BMC Medicine.

[3]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[4]  Timothy D. Knab,et al.  A “Virtual Patient” Cohort and Mathematical Model of Glucose Dynamics in Critical Care , 2016 .

[5]  Yeong Shiong Chiew,et al.  Next-generation, personalised, model-based critical care medicine: a state-of-the art review of in silico virtual patient models, methods, and cohorts, and how to validation them , 2018, BioMedical Engineering OnLine.

[6]  Gregory F. Cooper,et al.  Scoring Bayesian networks of mixed variables , 2018, International Journal of Data Science and Analytics.

[7]  A. Kivitz,et al.  Simulating clinical trial visits yields patient insights into study design and recruitment , 2017, Patient preference and adherence.

[8]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[9]  N. Holford,et al.  Clinical Trial Simulation: A Review , 2010, Clinical pharmacology and therapeutics.

[10]  Nir Giladi,et al.  Rotigotine transdermal patch in early Parkinson's disease: A randomized, double‐blind, controlled study versus placebo and ropinirole , 2007, Movement disorders : official journal of the Movement Disorder Society.

[11]  Moni Naor,et al.  Our Data, Ourselves: Privacy Via Distributed Noise Generation , 2006, EUROCRYPT.

[12]  S. Rafii,et al.  Splitting vessels: Keeping lymph apart from blood , 2003, Nature Medicine.

[13]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[14]  Pablo M. Olmos,et al.  Handling Incomplete Heterogeneous Data using VAEs , 2018, Pattern Recognit..

[15]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[16]  Frank Niemeyer,et al.  Exploring the Potential of Generative Adversarial Networks for Synthesizing Radiological Images of the Spine to be Used in In Silico Trials , 2018, Front. Bioeng. Biotechnol..

[17]  Xiangdong Zhou,et al.  Learning Bayesian Network Structure from Large-Scale Datasets , 2016, 2016 International Conference on Advanced Cloud and Big Data (CBD).

[18]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[19]  Kannan Ramchandran,et al.  Robust Federated Learning in a Heterogeneous Environment , 2019, ArXiv.

[20]  Vineet K Raghu,et al.  Evaluation of Causal Structure Learning Methods on Mixed Data Types , 2018, CD@KDD.

[21]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[22]  Hyun Kang The prevention and handling of the missing data , 2013, Korean journal of anesthesiology.

[23]  Sebastian B. M. Patzelt,et al.  The Virtual Patient , 2015 .

[24]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[25]  Chung Choo Chung,et al.  Phase Shift Calibration Method in Optical Sinusoidal Encoder Signals Applied to Servo Track Writer , 2016 .

[26]  Zoubin Ghahramani,et al.  Learning Dynamic Bayesian Networks , 1997, Summer School on Neural Networks.

[27]  Feng Ji,et al.  Comparison for Efficacy and Tolerability among Ten Drugs for Treatment of Parkinson’s Disease: A Network Meta-Analysis , 2017, Scientific Reports.

[28]  Marco Viceconti,et al.  In silico clinical trials: concepts and early adoptions , 2019, Briefings Bioinform..

[29]  A. Singleton,et al.  The Parkinson Progression Marker Initiative (PPMI) , 2011, Progress in Neurobiology.

[30]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[31]  Marco Scutari,et al.  Learning Bayesian Networks with the bnlearn R Package , 2009, 0908.3817.

[32]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[33]  Sarah A. Mustillo,et al.  Auxiliary Variables in Multiple Imputation When Data Are Missing Not at Random , 2015 .

[34]  David Maxwell Chickering,et al.  Large-Sample Learning of Bayesian Networks is NP-Hard , 2002, J. Mach. Learn. Res..

[35]  Zhiwei Steven Wu,et al.  Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing , 2017, bioRxiv.

[36]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[37]  Nir Friedman,et al.  Learning Module Networks , 2002, J. Mach. Learn. Res..

[38]  D. Rubin INFERENCE AND MISSING DATA , 1975 .