Fidelity and Privacy of Synthetic Medical Data

The digitization of medical records ushered in a new era of big data to clinical science, and with it the possibility that data could be shared, to multiply insights beyond what investigators could abstract from paper records. The need to share individual-level medical data to accelerate innovation in precision medicine continues to grow, and has never been more urgent, as scientists grapple with the COVID-19 pandemic. However, enthusiasm for the use of big data has been tempered by a fully appropriate concern for patient autonomy and privacy. That is, the ability to extract private or confidential information about an individual, in practice, renders it difficult to share data, since significant infrastructure and data governance must be established before data can be shared. Although HIPAA provided de-identification as an approved mechanism for data sharing, linkage attacks were identified as a major vulnerability. A variety of mechanisms have been established to avoid leaking private information, such as field suppression or abstraction, strictly limiting the amount of information that can be shared, or employing mathematical techniques such as differential privacy. Another approach, which we focus on here, is creating synthetic data that mimics the underlying data. For synthetic data to be a useful mechanism in support of medical innovation and a proxy for real-world evidence, one must demonstrate two properties of the synthetic dataset: (1) any analysis on the real data must be matched by analysis of the synthetic data (statistical fidelity) and (2) the synthetic data must preserve privacy, with minimal risk of re-identification (privacy guarantee). In this paper we propose a framework for quantifying the statistical fidelity and privacy preservation properties of synthetic datasets and demonstrate these metrics for synthetic data generated by Syntegra technology.

[1]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[2]  Sheridan Jeary,et al.  Re-identification attacks - A systematic literature review , 2016, Int. J. Inf. Manag..

[3]  Alberto Abadie,et al.  Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects , 2021, Journal of Economic Literature.

[4]  Bill Howe,et al.  Synthetic Data for Social Good , 2017, ArXiv.

[5]  Joshua Snoke,et al.  General and specific utility measures for synthetic data , 2016, 1604.06651.

[6]  Howard Bauchner,et al.  Data Sharing: An Ethical and Scientific Imperative. , 2016, JAMA.

[7]  Úlfar Erlingsson,et al.  The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks , 2018, USENIX Security Symposium.

[8]  Wouter Joosen,et al.  A Data Utility-Driven Benchmark for De-identification Methods , 2019, TrustBus.

[9]  Ziqi Zhang,et al.  Generating Electronic Health Records with Multiple Data Types and Constraints , 2020, AMIA.

[10]  Linda Coyle,et al.  Generation and evaluation of synthetic patient data , 2020, BMC Medical Research Methodology.

[11]  L. Cox Statistical Disclosure Limitation , 2006 .

[12]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[13]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection , 2018, J. Open Source Softw..

[14]  Kudakwashe Dube,et al.  Approach and Method for Generating Realistic Synthetic Electronic Healthcare Records for Secondary Use , 2013, FHIES.

[15]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[16]  Mark Kramer,et al.  Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record , 2017, J. Am. Medical Informatics Assoc..

[17]  Synthetic Data - A Privacy Mirage , 2020, ArXiv.

[18]  Alessandro Acquisti,et al.  Predicting Social Security numbers from public data , 2009, Proceedings of the National Academy of Sciences.

[19]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..