Privacy-preserving data sharing infrastructures for medical research: systematization and comparison

Background Data sharing is considered a crucial part of modern medical research. Unfortunately, despite its advantages, it often faces obstacles, especially data privacy challenges. As a result, various approaches and infrastructures have been developed that aim to ensure that patients and research participants remain anonymous when data is shared. However, privacy protection typically comes at a cost, e.g. restrictions regarding the types of analyses that can be performed on shared data. What is lacking is a systematization making the trade-offs taken by different approaches transparent. The aim of the work described in this paper was to develop a systematization for the degree of privacy protection provided and the trade-offs taken by different data sharing methods. Based on this contribution, we categorized popular data sharing approaches and identified research gaps by analyzing combinations of promising properties and features that are not yet supported by existing approaches. Methods The systematization consists of different axes. Three axes relate to privacy protection aspects and were adopted from the popular Five Safes Framework: (1) safe data, addressing privacy at the input level, (2) safe settings, addressing privacy during shared processing, and (3) safe outputs, addressing privacy protection of analysis results. Three additional axes address the usefulness of approaches: (4) support for de-duplication, to enable the reconciliation of data belonging to the same individuals, (5) flexibility, to be able to adapt to different data analysis requirements, and (6) scalability, to maintain performance with increasing complexity of shared data or common analysis processes. Results Using the systematization, we identified three different categories of approaches: distributed data analyses, which exchange anonymous aggregated data, secure multi-party computation protocols, which exchange encrypted data, and data enclaves, which store pooled individual-level data in secure environments for access for analysis purposes. We identified important research gaps, including a lack of approaches enabling the de-duplication of horizontally distributed data or providing a high degree of flexibility. Conclusions There are fundamental differences between different data sharing approaches and several gaps in their functionality that may be interesting to investigate in future work. Our systematization can make the properties of privacy-preserving data sharing infrastructures more transparent and support decision makers and regulatory authorities with a better understanding of the trade-offs taken.

[1]  Brett K. Beaulieu-Jones,et al.  International electronic health record-derived COVID-19 clinical course profiles: the 4CE consortium , 2020, npj Digital Medicine.

[2]  Bradley Malin,et al.  Anonymising and sharing individual patient data , 2015, BMJ : British Medical Journal.

[3]  Moni Naor,et al.  Theory and Applications of Models of Computation , 2015, Lecture Notes in Computer Science.

[4]  Peeter Laud,et al.  Privacy-preserving record linkage in large databases using secure multiparty computation , 2018, BMC Medical Genomics.

[5]  Distinct temporal trends in breast cancer incidence from 1997 to 2016 by molecular subtypes: a population-based study of Scottish cancer registry data , 2020, British Journal of Cancer.

[6]  Fabian Prasser,et al.  Protecting Biomedical Data Against Attribute Disclosure , 2019, GMDS.

[7]  Vladimir Kolesnikov,et al.  A Pragmatic Introduction to Secure Multi-Party Computation , 2019, Found. Trends Priv. Secur..

[8]  Harald Binder,et al.  Recovery of original individual person data (IPD) inferences from empirical IPD summaries only: Applications to distributed computing under disclosure constraints , 2020, Statistics in medicine.

[9]  Heather A. Piwowar,et al.  Data reuse and the open data citation advantage , 2013, PeerJ.

[10]  Luk Arbuckle,et al.  The Five Safes of Risk-Based Anonymization , 2019, IEEE Security & Privacy.

[11]  Fiona Steele,et al.  Probabilistic record linkage , 2015, International journal of epidemiology.

[12]  Frederik Armknecht,et al.  A Guide to Fully Homomorphic Encryption , 2015, IACR Cryptol. ePrint Arch..

[13]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[14]  M. Turk,et al.  Intellectual and developmental disability and COVID-19 case-fatality trends: TriNetX analysis , 2020, Disability and Health Journal.

[15]  Lucila Ohno-Machado,et al.  Effects of Data Anonymization by Cell Suppression on Descriptive Statistics and Predictive Modeling Performance , 2002, J. Am. Medical Informatics Assoc..

[16]  Johan Gustav Bellika,et al.  Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation , 2017, BMC Medical Informatics and Decision Making.

[17]  Udai Pratap Rao,et al.  Privacy Preserving Distributed Association Rule Mining Approach on Vertically Partitioned Healthcare Data , 2019, Procedia Computer Science.

[18]  Heather A. Piwowar,et al.  Sharing Detailed Research Data Is Associated with Increased Citation Rate , 2007, PloS one.

[19]  International electronic health record-derived COVID-19 clinical course profiles: the 4CE consortium. , 2020, NPJ digital medicine.

[20]  E. V. Veen,et al.  Assessment of the EU Member States’ rules on health data in the light of GDPR. , 2021 .

[21]  Emiliano De Cristofaro,et al.  Systematizing Genome Privacy Research: A Privacy-Enhancing Technologies Perspective , 2017, Proc. Priv. Enhancing Technol..

[22]  Paul R. Burton,et al.  DataSHIELD - shared individual-level analysis without sharing the data: a biostatistical perspective. , 2012 .

[23]  Ninghui Li,et al.  On the tradeoff between privacy and utility in data publishing , 2009, KDD.

[24]  Katherine A. Sauder,et al.  Associations between maternal physical activity in early and late pregnancy and offspring birth size: remote federated individual level meta‐analysis from eight cohort studies , 2018, BJOG : an international journal of obstetrics and gynaecology.

[25]  Jiazhou Wang,et al.  Distributed learning on 20 000+ lung cancer patients - The Personal Health Train. , 2020, Radiotherapy and oncology : journal of the European Society for Therapeutic Radiology and Oncology.

[26]  Lynn A. Karoly,et al.  Health Insurance Portability and Accountability Act of 1996 (HIPAA) Administrative Simplification , 2010, Practice Management Consultant.

[27]  Azer Bestavros,et al.  Conclave: secure multi-party computation on big data , 2019, EuroSys.

[28]  Patrick B. Ryan,et al.  Feasibility and evaluation of a large-scale external validation approach for patient-level prediction in an international data network: validation of models predicting stroke in female patients newly diagnosed with atrial fibrillation , 2020, BMC Medical Research Methodology.

[29]  Jean-Pierre Hubaux,et al.  MedCo: Enabling Secure and Privacy-Preserving Exploration of Distributed Clinical and Genomic Data , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  Richard Platt,et al.  Data Enclaves for Sharing Information Derived From Clinical and Administrative Data. , 2018, JAMA.

[31]  Ning Zhang,et al.  Preliminary exploration of survival analysis using the OHDSI common data model: a case study of intrahepatic cholangiocarcinoma , 2018, BMC Medical Informatics and Decision Making.

[32]  N. Shah,et al.  Treatment Patterns for Chronic Comorbid Conditions in Patients With Cancer Using a Large-Scale Observational Data Network , 2020, JCO clinical cancer informatics.

[33]  Shyam Visweswaran,et al.  Accrual to Clinical Trials (ACT): A Clinical and Translational Science Award Consortium Network , 2018, JAMIA open.

[34]  Dirk Pilat,et al.  OECD Principles and Guidelines for Access to Research Data from Public Funding , 2007, Data Sci. J..

[35]  P. Raina,et al.  MINDMAP: establishing an integrated database infrastructure for research in ageing, mental well-being, and the urban environment , 2018, BMC Public Health.

[36]  Richard Gonzalez,et al.  Responsible Practices for Data Sharing , 2018, The American psychologist.

[37]  Bartha Maria Knoppers,et al.  Framework for responsible sharing of genomic and health-related data , 2014, The HUGO Journal.

[38]  Jared Saia,et al.  Recent Results in Scalable Multi-Party Computation , 2015, SOFSEM.

[39]  Luiz Olavo Bonino da Silva Santos,et al.  Distributed Analytics on Sensitive Medical Data: The Personal Health Train , 2020, Data Intelligence.

[40]  C. McDonald,et al.  Risk of Alzheimer's Disease Among Senior Medicare Beneficiaries Treated With Androgen Deprivation Therapy for Prostate Cancer. , 2017, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[41]  Fabian Prasser,et al.  Flexible data anonymization using ARX—Current status and challenges ahead , 2020, Softw. Pract. Exp..

[42]  Ulrich Sax,et al.  Towards Structured Data Quality Assessment in the German Medical Informatics Initiative: Initial Approach in the MII Demonstrator Study , 2019, MedInfo.

[43]  Tony Blakely,et al.  Data Resource Profile: The New Zealand Integrated Data Infrastructure (IDI). , 2019, International journal of epidemiology.

[44]  Douglas M. Blough,et al.  Data obfuscation: anonymity and desensitization of usable data sets , 2004, IEEE Security & Privacy Magazine.

[45]  Kenneth D Mandl,et al.  Sharing Medical Data for Health Research: The Early Personal Health Record Experience , 2010, Journal of medical Internet research.

[46]  Stefan Katzenbeisser,et al.  Mainzelliste SecureEpiLinker (MainSEL): privacy-preserving record linkage using secure multi-party computation , 2020, Bioinform..

[47]  L. F. A. Wessels,et al.  Towards a global cancer knowledge network: dissecting the current international cancer genomic sequencing landscape , 2017, Annals of oncology : official journal of the European Society for Medical Oncology.

[48]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[49]  Iris Pigeot,et al.  Consent and confidentiality in the light of recent demands for data sharing , 2017, Biometrical journal. Biometrische Zeitschrift.

[50]  Harlan M Krumholz,et al.  Why data sharing should be the expected norm , 2015, BMJ : British Medical Journal.

[51]  H. Bauchner,et al.  Sharing Clinical Trial Data--A Proposal from the International Committee of Medical Journal Editors. , 2016, The New England journal of medicine.

[52]  Xiaoqian Jiang,et al.  Privacy-preserving techniques of genomic data - a survey , 2019, Briefings Bioinform..

[53]  Patrick Ryan,et al.  Opioid use, postoperative complications, and implant survival after unicompartmental versus total knee replacement: a population-based network study , 2019, The Lancet Rheumatology.

[54]  Xiaoqian Jiang,et al.  Secure Multi-pArty Computation Grid LOgistic REgression (SMAC-GLORE) , 2016, BMC Medical Informatics and Decision Making.

[55]  Gary H Lyman,et al.  The strengths and limitations of meta-analyses based on aggregate data , 2005, BMC Medical Research Methodology.

[56]  A. Meyer The Health Insurance Portability and Accountability Act. , 1997, Tennessee medicine : journal of the Tennessee Medical Association.

[57]  Tanvi Desai,et al.  Five Safes: designing data access for research , 2016 .

[58]  Sengwee Toh,et al.  Analytic and Data Sharing Options in Real‐World Multidatabase Studies of Comparative Effectiveness and Safety of Medical Products , 2019, Clinical pharmacology and therapeutics.

[59]  Umit Topaloglu,et al.  Using a Federated Network of Real-World Data to Optimize Clinical Trials Operations. , 2018, JCO clinical cancer informatics.

[60]  Oliver Butters,et al.  DataSHIELD: taking the analysis to the data, not the data to the analysis , 2014, International journal of epidemiology.

[61]  Tim Hulsen Sharing Is Caring—Data Sharing Initiatives in Healthcare , 2020, International journal of environmental research and public health.

[62]  Milton Packer,et al.  Data sharing in medical research , 2018, British Medical Journal.

[63]  Luc Rocher,et al.  Estimating the success of re-identifications in incomplete datasets using generative models , 2019, Nature Communications.

[64]  Yu-Chuan Li,et al.  Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers , 2015, MedInfo.

[65]  K. Marsolo,et al.  Applications of Business Analytics in Healthcare. , 2014, Business horizons.

[66]  U. Nöthlings,et al.  Dietary Patterns Are Associated with Serum Metabolite Patterns and Their Association Is Influenced by Gut Bacteria among Older German Adults , 2019, The Journal of nutrition.

[67]  Ian Foster,et al.  Research Infrastructure for the Safe Analysis of Sensitive Data , 2018 .

[68]  Henry C. Chueh,et al.  A security architecture for query tools used to access large biomedical databases , 2002, AMIA.

[69]  Laura E Schanberg,et al.  Research priorities in pediatric rheumatology: The Childhood Arthritis and Rheumatology Research Alliance (CARRA) consensus , 2008, Pediatric rheumatology online journal.

[70]  Mark Walport,et al.  Sharing research data to improve public health , 2011, The Lancet.

[71]  E. Dove,et al.  Consent and anonymisation , 2015 .

[72]  Yehuda Lindell,et al.  From Keys to Databases - Real-World Applications of Secure Multi-Party Computation , 2018, IACR Cryptol. ePrint Arch..

[73]  David Eckhoff,et al.  Metrics : a Systematic Survey , 2018 .

[74]  Andre B. Bondi,et al.  Characteristics of scalability and their impact on performance , 2000, WOSP '00.

[75]  Griffin M Weber,et al.  Federated queries of clinical data repositories: the sum of the parts does not equal the whole. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[76]  Carl A. Gunter,et al.  Privacy in the Genomic Era , 2014, ACM Comput. Surv..

[77]  Douglas MacFadden,et al.  Application of Information Technology The Shared Health Research Information Network ( SHRINE ) : A Prototype Federated Query Tool for Clinical Data Repositories , 2014 .

[78]  Fabian Prasser,et al.  SCOR: A secure international informatics infrastructure to investigate COVID-19 , 2020, J. Am. Medical Informatics Assoc..

[79]  Chandra Thapa,et al.  Precision Health Data: Requirements, Challenges and Existing Techniques for Data Security and Privacy , 2020, Comput. Biol. Medicine.

[80]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[81]  Douglas MacFadden,et al.  SHRINE: Enabling Nationally Scalable Multi-Site Disease Studies , 2013, PloS one.

[82]  Rainer Schnell,et al.  Bmc Medical Informatics and Decision Making Privacy-preserving Record Linkage Using Bloom Filters , 2022 .

[83]  D. Carr,et al.  Sharing Research Data to Improve Public Health , 2015, Journal of empirical research on human research ethics : JERHRE.