Bringing the Algorithms to the Data - Secure Distributed Medical Analytics using the Personal Health Train (PHT-meDIC)

. The need for data privacy and security – enforced through increasingly strict data protection regulations – renders the use of healthcare data for machine learning difficult. In particular, the transfer of data between different hospitals is often not permissible and thus cross-site pooling of data not an option. The Personal Health Train (PHT) paradigm proposed within the GO-FAIR initiative implements an ’algorithm to the data’ paradigm that ensures that distributed data can be accessed for analysis without transferring any sensitive data. We present PHT-meDIC, a productively deployed open-source implementation of the PHT concept. Containerization allows us to easily deploy even complex data analysis pipelines (e.g, genomics, image analysis) across multiple sites in a secure and scalable manner. We discuss the underlying technological concepts, security models, and governance processes. The implementation has been successfully applied to distributed analyses of large-scale data, including applications of deep neural networks to medical image data.

[1]  Sascha Welten,et al.  A Privacy-Preserving Distributed Analytics Platform for Health Care Data , 2022, Methods of information in medicine.

[2]  João Sá Sousa,et al.  Truly privacy-preserving federated analytics for precision medicine with multiparty homomorphic encryption , 2021, Nature Communications.

[3]  Oliver Kohlbacher,et al.  Identifying disease-causing mutations with privacy protection , 2020, Bioinform..

[4]  Si Chen,et al.  Improved Techniques for Model Inversion Attacks , 2021, ArXiv.

[5]  Andre Dekker,et al.  Personal Health Train on FHIR: A Privacy Preserving Federated Approach for Analyzing FAIR Data in Healthcare , 2020, ICML 2020.

[6]  Sven Nahnsen,et al.  The nf-core framework for community-curated bioinformatics pipelines , 2020, Nature Biotechnology.

[7]  Jiazhou Wang,et al.  Distributed learning on 20 000+ lung cancer patients - The Personal Health Train. , 2020, Radiotherapy and oncology : journal of the European Society for Therapeutic Radiology and Oncology.

[8]  Luiz Olavo Bonino da Silva Santos,et al.  Distributed Analytics on Sensitive Medical Data: The Personal Health Train , 2020, Data Intelligence.

[9]  Nils Gessert,et al.  Skin lesion classification using ensembles of multi-resolution EfficientNets with meta data , 2019, MethodsX.

[10]  Jean-Pierre Hubaux,et al.  Drynx: Decentralized, Secure, Verifiable System for Statistical Queries and Machine Learning on Distributed Datasets , 2019, IEEE Transactions on Information Forensics and Security.

[11]  G. Geleijnse,et al.  VANTAGE6: an open source priVAcy preserviNg federaTed leArninG infrastructurE for Secure Insight eXchange , 2020, AMIA.

[12]  N. Malek,et al.  Die bwHealthApp: Eine Plattform und Infrastruktur zum dauerhaften dezentralen individuellen Patientenmonitoring für die personalisierte Medizin , 2020 .

[13]  Andre Dekker,et al.  Distributed radiomics as a signature validation study using the Personal Health Train infrastructure , 2019, Scientific Data.

[14]  Verónica Vilaplana,et al.  BCN20000: Dermoscopic Lesions in the Wild , 2019, Scientific data.

[15]  Jean-Pierre Hubaux,et al.  MedCo: Enabling Secure and Privacy-Preserving Exploration of Distributed Clinical and Genomic Data , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  S. Semler,et al.  German Medical Informatics Initiative , 2018, Methods of Information in Medicine.

[17]  Fabian Prasser,et al.  Data Integration for Future Medicine (DIFUTURE) , 2018, Methods of Information in Medicine.

[18]  Harald Kittler,et al.  Descriptor : The HAM 10000 dataset , a large collection of multi-source dermatoscopic images of common pigmented skin lesions , 2018 .

[19]  Virginia Dignum,et al.  Ethics in artificial intelligence: introduction to the special issue , 2018, Ethics and Information Technology.

[20]  Ronald Kemker,et al.  Measuring Catastrophic Forgetting in Neural Networks , 2017, AAAI.

[21]  Noel C. F. Codella,et al.  Skin lesion analysis toward melanoma detection: A challenge at the 2017 International symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC) , 2016, 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).

[22]  Dr. Kailash Shaw,et al.  Skin Lesion Analysis towards Melanoma Detection , 2018 .

[23]  M. V. van Herk,et al.  Data Mining in Oncology: The ukCAT Project and the Practicalities of Working with Routine Patient Data. , 2017, Clinical oncology (Royal College of Radiologists (Great Britain)).

[24]  Zhicong Huang,et al.  UnLynx: A Decentralized System for Privacy-Conscious Data Sharing , 2017, Proc. Priv. Enhancing Technol..

[25]  Haipeng Shen,et al.  Artificial intelligence in healthcare: past, present and future , 2017, Stroke and Vascular Neurology.

[26]  Timo M. Deist,et al.  Infrastructure and distributed learning methodology for privacy-preserving multi-centric rapid learning health care: euroCAT , 2017, Clinical and translational radiation oncology.

[27]  Oliver Butters,et al.  DataSHIELD - New Directions and Dimensions , 2017, Data Sci. J..

[28]  Vitaly Shmatikov,et al.  Membership Inference Attacks Against Machine Learning Models , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[29]  Christopher J. Dente,et al.  Precision diagnosis: a view of the clinical decision support systems (CDSS) landscape through the lens of critical care , 2017, Journal of Clinical Monitoring and Computing.

[30]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[31]  Somesh Jha,et al.  Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures , 2015, CCS.

[32]  Peter J. Hunter,et al.  Big Data, Big Knowledge: Big Data for Personalized Healthcare , 2015, IEEE Journal of Biomedical and Health Informatics.

[33]  Yu-Chuan Li,et al.  Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers , 2015, MedInfo.

[34]  Dina Aronzon,et al.  tranSMART: An Open Source Knowledge Management and High Content Data Analytics Platform , 2014, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[35]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[36]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[37]  I. Torjesen Genomes of 100 000 people will be sequenced to create an open access research resource , 2013, BMJ.

[38]  L. Ohno-Machado,et al.  Identifying inference attacks against healthcare data repositories , 2013, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[39]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[40]  Anand D. Sarwate,et al.  Protecting count queries in study design , 2012, J. Am. Medical Informatics Assoc..

[41]  David Burgner,et al.  HLA and Infectious Diseases , 2009, Clinical Microbiology Reviews.

[42]  Dan Bogdanov,et al.  Sharemind: A Framework for Fast Privacy-Preserving Computations , 2008, ESORICS.

[43]  C. McDonald,et al.  Clinical Decision Support Within the Regenstrief Medical Record System , 2007 .

[44]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[45]  E. Thorsby,et al.  The human major histocompatibility system. , 1974, Transplantation reviews.