Realistic simulation of virtual multi-scale, multi-modal patient trajectories using Bayesian networks and sparse auto-encoders

Translational research of many disease areas requires a longitudinal understanding of disease development and progression across all biologically relevant scales. Several corresponding studies are now available. However, to compile a comprehensive picture of a specific disease, multiple studies need to be analyzed and compared. A large number of clinical studies is nowadays conducted in the context of drug development in pharmaceutical research. However, legal and ethical constraints typically do not allow for sharing sensitive patient data. In consequence there exist data “silos”, which slow down the overall scientific progress in translational research. In this paper, we suggest the idea of a virtual cohort (VC) to address this limitation. Our key idea is to describe a longitudinal patient cohort with the help of a generative statistical model, namely a modular Bayesian Network, in which individual modules are represented as sparse autoencoder networks. We show that with the help of such a model we can simulate subjects that are highly similar to real ones. Our approach allows for incorporating arbitrary multi-scale, multi-modal data without making specific distribution assumptions. Moreover, we demonstrate the possibility to simulate interventions (e.g. via a treatment) in the VC. Overall, our proposed approach opens the possibility to build sufficiently realistic VCs for multiple disease areas in the future.

[1]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[2]  Nir Friedman,et al.  Data Analysis with Bayesian Networks: A Bootstrap Approach , 1999, UAI.

[3]  Sarah A. Mustillo,et al.  Auxiliary Variables in Multiple Imputation When Data Are Missing Not at Random , 2015 .

[4]  Max Henrion,et al.  Propagating uncertainty in bayesian networks by probabilistic logic sampling , 1986, UAI.

[5]  A. Singleton,et al.  The Parkinson Progression Marker Initiative (PPMI) , 2011, Progress in Neurobiology.

[6]  Marco Scutari,et al.  Learning Bayesian Networks with the bnlearn R Package , 2009, 0908.3817.

[7]  M. Greenacre,et al.  Multiple Correspondence Analysis and Related Methods , 2006 .

[8]  Tamás D. Gedeon,et al.  Data Mining of Inputs: Analysing Magnitude and Functional Measures , 1997, Int. J. Neural Syst..

[9]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[10]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[11]  A. Stiggelbout,et al.  Systematic evaluation of rating scales for impairment and disability in Parkinson's disease , 2002, Movement disorders : official journal of the Movement Disorder Society.

[12]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[13]  Constantin F. Aliferis,et al.  The max-min hill-climbing Bayesian network structure learning algorithm , 2006, Machine Learning.

[14]  Sheng Luo,et al.  Functional joint model for longitudinal and time‐to‐event data: an application to Alzheimer's disease , 2017, Statistics in medicine.

[15]  Xiangdong Zhou,et al.  Learning Bayesian Network Structure from Large-Scale Datasets , 2016, 2016 International Conference on Advanced Cloud and Big Data (CBD).

[16]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[17]  Liang Li,et al.  Alzheimer's disease progression model based on integrated biomarkers and clinical measures , 2014, Acta Pharmacologica Sinica.

[18]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[19]  Zoubin Ghahramani,et al.  Learning Dynamic Bayesian Networks , 1997, Summer School on Neural Networks.

[20]  Michael W. Weiner,et al.  APOE and BCHE as modulators of cerebral amyloid deposition: a florbetapir PET genome-wide association study , 2013, Molecular Psychiatry.

[21]  Mohammad Asif Emon,et al.  Using Multi-Scale Genetic, Neuroimaging and Clinical Data for Predicting Alzheimer’s Disease and Reconstruction of Relevant Biological Mechanisms , 2018, Scientific Reports.

[22]  David Maxwell Chickering,et al.  Large-Sample Learning of Bayesian Networks is NP-Hard , 2002, J. Mach. Learn. Res..

[23]  Graciela Muniz Terrera,et al.  DURATION OF ALZHEIMER’S DISEASE IN THE PRECLINICAL, PRODROMAL AND DEMENTIA STAGE: A MULTI-STATE MODEL ANALYSIS , 2017, Alzheimer's & Dementia.

[24]  Holger Fröhlich,et al.  Integrating Heterogeneous omics Data via Statistical Inference and Learning Techniques , 2016 .

[25]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[26]  Marco Viceconti,et al.  In silico clinical trials: concepts and early adoptions , 2019, Briefings Bioinform..

[27]  Constantin F. Aliferis,et al.  Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation , 2010, J. Mach. Learn. Res..

[28]  Ian A. Watson,et al.  Dopamine Transporter Neuroimaging as an Enrichment Biomarker in Early Parkinson's Disease Clinical Trials: A Disease Progression Modeling Analysis , 2017, Clinical and translational science.

[29]  Nir Friedman,et al.  Learning Module Networks , 2002, J. Mach. Learn. Res..

[30]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[31]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[32]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[33]  A. Singleton,et al.  A Bayesian mathematical model of motor and cognitive outcomes in Parkinson’s disease , 2017, PloS one.

[34]  José Manuel Gutiérrez,et al.  Who learns better Bayesian network structures: Accuracy and speed of structure learning algorithms , 2018, Int. J. Approx. Reason..

[35]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[36]  Mert R. Sabuncu,et al.  Statistical analysis of longitudinal neuroimage data with Linear Mixed Effects models , 2013, NeuroImage.

[37]  Vineet K Raghu,et al.  Evaluation of Causal Structure Learning Methods on Mixed Data Types , 2018, CD@KDD.

[38]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[39]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[40]  Hyun Kang The prevention and handling of the missing data , 2013, Korean journal of anesthesiology.

[41]  Anders M. Dale,et al.  An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest , 2006, NeuroImage.