De-identification of Clinical Text for Secondary Use: Research Issues

Privacy is challenged by both advances in AI-related technologies and recently introduced legal regulations. The problem of privacy has been extensively studied within the privacy community, but has largely focused on methods for protecting and assessing the privacy of structured data. Research aiming to protect the integrity of patients based on clinical text has primarily referred to US law and relied on automatically recognising predetermined, both direct and indirect, identifiers. This article discusses the various challenges concerning the re-use of unstructured clinical data, in particular in the form of clinical text, and focuses on ambiguous and vague terminology, how different legislation affects the requirements for de-identification, differences between methods for unstructured and structured data, the impact of approaches based on named entity recognition and replacing sensitive data with surrogates, as well as the lack of measures for usability and re-identification risk.

[1]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[2]  B. Fitzgerald Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule , 2015 .

[3]  Keith Marsolo,et al.  Large-scale evaluation of automated clinical note de-identification and its impact on information extraction , 2013, J. Am. Medical Informatics Assoc..

[4]  Xiaoqian Jiang,et al.  Privacy Policy and Technology in Biomedical Data Science. , 2018, Annual review of biomedical data science.

[5]  Olof Mogren,et al.  Named Entity Recognition in Swedish Health Records with Character-Based Deep Bidirectional LSTMs , 2016, BioTxtM@COLING 2016.

[6]  Lynette Hirschman,et al.  Effects of personal identifier resynthesis on clinical text de-identification , 2010, J. Am. Medical Informatics Assoc..

[7]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[8]  Heike Adel,et al.  NLNDE: The Neither-Language-Nor-Domain-Experts' Way of Spanish Medical Document De-Identification , 2020, IberLEF@SEPLN.

[9]  Cyril Grouin,et al.  Is it possible to recover personal health information from an automatically de-identified corpus of French EHRs? , 2015, Louhi@EMNLP.

[10]  Hercules Dalianis,et al.  The Impact of De-identification on Downstream Named Entity Recognition in Clinical Text , 2020, LOUHI.

[11]  Li Xiong,et al.  An integrated framework for de-identifying unstructured medical data , 2009, Data Knowl. Eng..

[12]  Christian Lovis,et al.  Use and Understanding of Anonymization and De-Identification in the Biomedical Literature: Scoping Review , 2019, Journal of medical Internet research.

[13]  Özlem Uzuner,et al.  Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1 , 2015, J. Biomed. Informatics.

[14]  Lynette Hirschman,et al.  Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text , 2013, J. Am. Medical Informatics Assoc..

[15]  Hercules Dalianis,et al.  Clinical Text Mining: Secondary Use of Electronic Patient Records , 2018 .

[16]  Xiao-Bai Li,et al.  Anonymizing and Sharing Medical Text Records , 2017, Inf. Syst. Res..

[17]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[18]  L. Sweeney Replacing personally-identifying information in medical records, the Scrub system. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[19]  Franck Dernoncourt,et al.  De-identification of patient notes with recurrent neural networks , 2016, J. Am. Medical Informatics Assoc..

[20]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[21]  Hercules Dalianis Pseudonymisation of Swedish Electronic Patient Records Using a Rule-Based Approach , 2019 .

[22]  Aitor Gonzalez-Agirre,et al.  Automatic De-identification of Medical Texts in Spanish: the MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results , 2019, IberLEF@SEPLN.

[23]  Stéphane M. Meystre,et al.  Text de-identification for privacy protection: A study of its impact on clinical text information content , 2014, J. Biomed. Informatics.

[24]  Gillian M. Raab,et al.  synthpop: Bespoke Creation of Synthetic Data in R , 2016 .

[25]  S. Meystre,et al.  Automatic de-identification of textual documents in the electronic health record: a review of recent research , 2010, BMC medical research methodology.

[26]  Deborah A. Nichols,et al.  Strategies for De-identification and Anonymization of Electronic Health Record Data for Use in Multicenter Research Studies , 2012, Medical care.

[27]  Hercules Dalianis,et al.  Building a De-identification System for Real Swedish Clinical Text Using Pseudonymised Clinical Text , 2019, LOUHI@EMNLP.

[28]  Shuying Shen,et al.  Can Physicians Recognize Their Own Patients in De-identified Notes? , 2014, MIE.

[29]  Joshua C. Denny,et al.  The disclosure of diagnosis codes can breach research participants' privacy , 2010, J. Am. Medical Informatics Assoc..

[30]  Henrik Boström,et al.  Releasing a Swedish Clinical Corpus after Removing all Words – De-identification Experiments with Conditional Random Fields and Random Forests , 2012 .

[31]  Latanya Sweeney,et al.  Risks to Patient Privacy: A Re-identification of Patients in Maine and Vermont Statewide Hospital Data , 2018 .

[32]  Martín Abadi,et al.  Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data , 2016, ICLR.

[33]  Øystein Nytrø,et al.  Iterative development of family history annotation guidelines using a synthetic corpus of clinical text , 2018, Louhi@EMNLP.

[34]  Michele Filannino,et al.  De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID shared tasks Track 1. , 2017, Journal of biomedical informatics.

[35]  Khaled El Emam,et al.  Guide to the De-Identification of Personal Health Information , 2013 .

[36]  Stephane M Meystre,et al.  Impact of De-Identification on Clinical Text Classification Using Traditional and Deep Learning Classifiers , 2019, MedInfo.

[37]  Michael Naehrig,et al.  Private Predictive Analysis on Encrypted Medical Data , 2014, IACR Cryptol. ePrint Arch..