Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation

Background There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. Objective The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. Methods A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this “meaningful identity disclosure risk.” The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. Results The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. Conclusions We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.

[1]  A. Sheikh,et al.  Public perceptions on data sharing: key insights from the UK and the USA , 2020, The Lancet Digital Health.

[2]  Meemansa Sood,et al.  Variational Autoencoder Modular Bayesian Networks for Simulation of Heterogeneous Clinical Study Data , 2020, Frontiers in Big Data.

[3]  Daniel S Quintana,et al.  A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation , 2020, eLife.

[4]  D. Nowicki,et al.  Less than five is less than ideal: replacing the “less than 5 cell size” rule with a risk-based data disclosure protocol in a public health setting , 2020, Canadian Journal of Public Health.

[5]  Christian W. Probst,et al.  Evaluating the re-identification risk of a clinical study report anonymized under EMA Policy 0070 and Health Canada Regulations , 2020, Trials.

[6]  Michael E Matheny,et al.  Artificial Intelligence in Health Care: A Report From the National Academy of Medicine. , 2019, JAMA.

[7]  Thomas Sutter,et al.  Generation of Heterogeneous Synthetic Electronic Health Records using GANs , 2019, NeurIPS 2019.

[8]  Chao Yan,et al.  Ensuring electronic medical record simulation through better training, modeling, and evaluation , 2019, J. Am. Medical Informatics Assoc..

[9]  Yi Feng,et al.  The Promise and Limitations of Synthetic Data as a Strategy to Expand Access to State-Level Multi-Agency Longitudinal Data , 2019, Journal of Research on Educational Effectiveness.

[10]  Lucy Rosenbloom arXiv , 2019, The Charleston Advisor.

[11]  Krishna P. Gummadi,et al.  Auditing Offline Data Brokers via Facebook's Advertising Platform , 2019, WWW.

[12]  Maria Pampaka,et al.  Differential Correct Attribution Probability for Synthetic Data: An Exploration , 2018, PSD.

[13]  Josep Domingo-Ferrer,et al.  On the Privacy Guarantees of Synthetic Data: A Reassessment from the Maximum-Knowledge Attacker Perspective , 2018, PSD.

[14]  Stéphane Bressan,et al.  A Comparative Study of Synthetic Dataset Generation Techniques , 2018, DEXA.

[15]  Ruben C. Arslan,et al.  Using 26,000 diary entries to show ovulatory changes in sexual desire and behavior. , 2018, Journal of personality and social psychology.

[16]  Sushil Jajodia,et al.  Data Synthesis based on Generative Adversarial Networks , 2018, Proc. VLDB Endow..

[17]  Gillian M. Raab,et al.  Practical Data Synthesis for Large Samples , 2018, J. Priv. Confidentiality.

[18]  Haibo He,et al.  Variational autoencoder based synthetic data generation for imbalanced learning , 2017, 2017 IEEE Symposium Series on Computational Intelligence (SSCI).

[19]  Gillian M. Raab,et al.  Providing bespoke synthetic data for the UK Longitudinal Studies and other sensitive data with the synthpop package for R1 , 2017 .

[20]  Ruben C. Arslan,et al.  Using 26 thousand diary entries to show ovulatory changes in sexual desire and behaviour , 2017 .

[21]  Matthias Templ,et al.  Statistical Disclosure Control for Microdata: Methods and Applications in R , 2017 .

[22]  Nataraj Venkataramanan,et al.  Synthetic Data Generation , 2016 .

[23]  Khaled El Emam,et al.  A critical appraisal of the Article 29 Working Party Opinion 05/2014 on data anonymization techniques , 2015 .

[24]  Y. de Montjoye,et al.  Unique in the shopping mall: On the reidentifiability of credit card metadata , 2015, Science.

[25]  Malika Charrad,et al.  NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set , 2014 .

[26]  Jerome P. Reiter,et al.  Disclosure Risk Evaluation for Fully Synthetic Categorical Data , 2014, Privacy in Statistical Databases.

[27]  Yulei He,et al.  Disclosure control using partially synthetic data for large‐scale health surveys, with applications to CanCORS , 2013, Statistics in medicine.

[28]  Latanya Sweeney,et al.  Matching Known Patients to Health Records in Washington State Data , 2013, ArXiv.

[29]  A. Costello,et al.  Error rates in a clinical data repository: lessons from the transition to electronic data transfer—a descriptive study , 2013, BMJ Open.

[30]  Khaled El Emam,et al.  Guide to the De-Identification of Personal Health Information , 2013 .

[31]  César A. Hidalgo,et al.  Unique in the Crowd: The privacy bounds of human mobility , 2013, Scientific Reports.

[32]  Josep Domingo-Ferrer,et al.  Statistical Disclosure Control: Hundepool/Statistical Disclosure Control , 2012 .

[33]  Boris Otto,et al.  Data Governance , 2012, Bus. Inf. Syst. Eng..

[34]  Luk Arbuckle,et al.  El Emam Et Al.: the De‐identification of the Heritage Health Prize Claims Data Set Multimedia Appendix Multimedia Appendix 1 Truncation of Claims 2 Removal of High Risk Patients , 2022 .

[35]  Jörg Drechsler,et al.  An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets , 2011, Comput. Stat. Data Anal..

[36]  Sergio I. Prada,et al.  Creation of public use files: lessons learned from the comparative effectiveness research public use files data pilot project , 2011 .

[37]  Khaled El Emam,et al.  De-identifying a public use microdata file from the Canadian national discharge abstract database , 2011, BMC Medical Informatics Decis. Mak..

[38]  Jules T. Mitchel,et al.  Evaluation of Data Entry Errors and Data Changes to an Electronic Data Capture Clinical Trial Database , 2011, Drug information journal.

[39]  Juan José SALAZAR-GONZÁLEZ,et al.  Statistical Confidentiality: Principles and Practice , 2011 .

[40]  Bradley Malin,et al.  Evaluating re-identification risks with respect to the HIPAA privacy rule , 2010, J. Am. Medical Informatics Assoc..

[41]  Claudio Conversano,et al.  Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering , 2009, J. Classif..

[42]  Jerome P. Reiter,et al.  Estimating Risks of Identification Disclosure in Partially Synthetic Data , 2009, J. Priv. Confidentiality.

[43]  Jörg Drechsler,et al.  Accounting for Intruder Uncertainty Due to Sampling When Estimating Identification Disclosure Risks in Partially Synthetic Data , 2008, Privacy in Statistical Databases.

[44]  Carl F. Pieper,et al.  Quantifying Data Quality for Clinical Trials Using Electronic Data Capture , 2008, PloS one.

[45]  Monika M. Wahi,et al.  Case Report: Reducing Errors from the Electronic Transcription of Data Collected on Paper Forms: A Research Data Case Study , 2008, J. Am. Medical Informatics Assoc..

[46]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[47]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .

[48]  Jerome P. Reiter,et al.  New Approaches to Data Dissemination: A Glimpse into the Future (?) , 2004 .

[49]  H. Dean,et al.  Integrated guidelines for developing epidemiologic profiles : HIV prevention and Ryan White CARE Act community planning , 2004 .

[50]  S. Becker THE HEALTH INSURANCE PORTABILITY AND ACCOUNTABILITY ACT , 2004 .

[51]  M. Usher,et al.  The United Kingdom , 2001, Western Europe and its Islam.

[52]  L. Willenborg,et al.  Elements of Statistical Disclosure Control , 2000 .

[53]  Ton de Waal,et al.  Statistical Disclosure Control in Practice , 1996 .

[54]  Uwe Blien,et al.  Disclosure risk for microdata stemming from official statistics , 1992 .

[55]  C. J. Skinner,et al.  On identification disclosure and prediction disclosure for microdata , 1992 .

[56]  C. Skinner,et al.  The case for samples of anonymized records from the 1991 census. , 1991, Journal of the Royal Statistical Society. Series A,.

[57]  J. Friedman,et al.  Classification and Regression Trees , 1984 .

[58]  Alexander La,et al.  Access to social security microdata files for research and statistical purposes. , 1978, Social security bulletin.

[59]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[60]  G. DeFriese,et al.  The New York Times , 2020, Publishing for Libraries.

[61]  Laurie Harris,et al.  Overcoming Small Data Limitations in Heart Disease Prediction by Using Surrogate Data , 2018 .

[62]  Devesh D. Nawgaje Dr,et al.  Information and , 2018 .

[63]  Jerome P. Reiter,et al.  Releasing synthetic magnitude microdata constrained to fixed marginal totals , 2016 .

[64]  Jude Hillary Statistical policy statement on confidentiality , 2013 .

[65]  Jörg Drechsler,et al.  Partially Synthetic Datasets , 2011 .

[66]  Jörg Drechsler,et al.  Disclosure risk and data utility for partially synthetic data: an empirical study using the german IAB establishment survey , 2009 .

[67]  Josep Domingo-Ferrer,et al.  Synthetic Microdata , 2009, Encyclopedia of Database Systems.

[68]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[69]  H. Humphrey,et al.  Standards for privacy of individually identifiable health information. , 2003, Health care law monthly.

[70]  Jacob KOHNSTAMM,et al.  ARTICLE 29 DATA PROTECTION WORKING PARTY , 2002 .

[71]  Ag De Waal,et al.  A view on statistical disclosure control for microdata , 1996 .

[72]  G. Duncan,et al.  Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics , 1993 .

[73]  R. Iman,et al.  A distribution-free approach to inducing rank correlation among input variables , 1982 .