Effect of incorporating metadata to the generation of synthetic time series in a healthcare context

Synthetic data is becoming the way forward to manage legal and regulatory aspects of biomedical research involving personal and clinical data. As no matches are expected between artificial instances and real samples and/or subjects, external researchers performing secondary analyses could benefit significantly by having unlimited access to uncompromised information. In this context, one of the main objectives of the H2020 VITALISE project is to develop a platform for providing synthetic data generated from real data collected in Living Labs to those external researchers. In addition, while some time series specific synthetic data generation models exist, only a few of them consider metadata (e.g., patient demographics) as part of the time series generation process itself. Therefore, the objective of this research is to perform a comparative assessment of two synthetic data generation models that use and process the metadata of subjects differently: The Wasserstein GAN with Gradient Penalty (WGAN-GP) and the DöppelGANger (DGAN). To achieve this goal making sure the analyses were data-independent, we selected two healthcare-related longitudinal datasets: (1) Treadmill Maximal Effort Test (TMET) measurements from the University of Málaga; and (2) a hypotension subset derived from the MIMIC-III v1.4 database. After synthetic data was generated, we assessed three pivotal aspects: resemblance to the original data, utility, and level of privacy. As a main conclusion, the importance of using metadata as context variables and the methodology to take them into account was proved to be significant and valuable, the DGAN model offering better results overall. A more extensive time series specific evaluation is left as the main avenue for future research.

[1]  K. El Emam,et al.  Synthetic data as an enabler for machine learning applications in medicine , 2022, iScience.

[2]  Zhengwei Wang,et al.  Generative Adversarial Networks in Time Series: A Systematic Literature Review , 2022, ACM Comput. Surv..

[3]  E. Konstantinidis,et al.  Synthetic Subject Generation with Coupled Coherent Time Series Data , 2022, ITISE 2022.

[4]  Debbie Rankin,et al.  Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions , 2022, Methods of Information in Medicine.

[5]  Debbie Rankin,et al.  Synthetic data generation for tabular health records: A systematic review , 2022, Neurocomputing.

[6]  A. Sönnerborg,et al.  The Health Gym: synthetic health-related datasets for the development of reinforcement learning algorithms , 2022, Scientific Data.

[7]  E. Konstantinidis,et al.  Incorporation of Synthetic Data Generation Techniques within a Controlled Data Processing Workflow in the Health and Wellbeing Domain , 2022, Electronics.

[8]  D. Courvoisier,et al.  Heart rate recovery to assess fitness: comparison of different calculation methods in a large cross-sectional study , 2021, Research in sports medicine.

[9]  Gorka Epelde,et al.  Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing , 2020, JMIR medical informatics.

[10]  Isabelle Guyon,et al.  Generation and evaluation of privacy preserving synthetic health data , 2020, Neurocomputing.

[11]  G. Fanti,et al.  Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions , 2019, Internet Measurement Conference.

[12]  Olivier Bachem,et al.  Assessing Generative Models via Precision and Recall , 2018, NeurIPS.

[13]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[14]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[15]  HYPOTENSION , 1974 .

[16]  D. Mongin,et al.  Treadmill Maximal Exercise Tests from the Exercise Physiology and Human Performance Lab of the University of Malaga , 2021 .