De-identifying Spanish medical texts - Named Entity Recognition applied to radiology reports

Medical texts such as radiology reports or electronic health records are a powerful source of data for researchers. Anonymization methods must be developed to de-identify documents containing personal information from both patients and medical staff. Although currently there are several anonymization strategies for the English language, they are also language-dependent. Here, we introduce a named entity recognition strategy for Spanish medical texts, translatable to other languages. We tested 4 neural networks on our radiology reports dataset, achieving a recall of 97.18% of the identifying entities. Alongside, we developed a randomization algorithm to substitute the detected entities with new ones from the same category, making it virtually impossible to differentiate real data from synthetic data. The three best architectures were tested with the MEDDOCAN challenge dataset of electronic health records as an external test, achieving a recall of 69.18%. The strategy proposed, combining named entity recognition tasks with randomization of entities, is suitable for Spanish radiology reports. It does not require a big training corpus, thus it can be easily extended to other languages and medical texts, such as electronic health records.

[1]  Hugo A. Katus,et al.  Deep Learning Approaches Outperform Conventional Strategies in De-Identification of German Medical Reports , 2019, GMDS.

[2]  Jorge Turmo Borras,et al.  Building a Spanish/Catalan health records corpus with very sparse protected information labelled , 2018 .

[3]  Sumithra Velupillai,et al.  De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields , 2010, J. Biomed. Semant..

[4]  Spiros C. Denaxas,et al.  Big data from electronic health records for early and late translational cardiovascular research: challenges and potential , 2017, European heart journal.

[5]  Cyril Grouin,et al.  De-identification of clinical notes in French: towards a protocol for reference corpus development , 2014, J. Biomed. Informatics.

[6]  Marco Spruit,et al.  DEDUCE: A pattern matching method for automatic de-identification of Dutch medical text , 2017, Telematics Informatics.

[7]  Manex Serras,et al.  Vicomtech at MEDDOCAN: Medical Document Anonymization , 2019, IberLEF@SEPLN.

[8]  Antonio Pertusa,et al.  PadChest: A large chest x-ray image dataset with multi-label annotated reports , 2019, Medical Image Anal..

[9]  Pierre Zweigenbaum,et al.  Clinical Natural Language Processing in languages other than English: opportunities and challenges , 2018, Journal of Biomedical Semantics.

[10]  Jan Christoph,et al.  Semi-automated De-identification of German Content Sensitive Reports for Big Data Analytics , 2017, RöFo - Fortschritte auf dem Gebiet der Röntgenstrahlen und der bildgebenden Verfahren.

[11]  Rui Zhang,et al.  A cascaded approach for Chinese clinical text de-identification with less annotation effort , 2017, J. Biomed. Informatics.

[12]  César de Pablo-Sánchez,et al.  Anonimytext: Anonimization of Unstructured Documents , 2009, KDIR.

[13]  Novedades Para,et al.  LEY ORGÁNICA 3/2018, DE 5 DE DICIEMBRE, DE PROTECCIÓN DE DATOS PERSONALES Y GARANTÍA DE LOS DERECHOS DIGITALES , 2019, Protección de datos personales.

[14]  Aitor Gonzalez-Agirre,et al.  Automatic De-identification of Medical Texts in Spanish: the MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results , 2019, IberLEF@SEPLN.

[15]  Régis Beuscart,et al.  Proposal and evaluation of FASDIM, a Fast And Simple De-Identification Method for unstructured free-text clinical records , 2014, Int. J. Medical Informatics.

[16]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[17]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[18]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[19]  C M Faddick Health care fraud and abuse: new weapons, new penalties, and new fears for providers created by the Health Insurance Portability and Accountability Act of 1996 ("HIPAA"). , 1997, Annals of health law.

[20]  Rudolf N. Cardinal,et al.  Clinical records anonymisation and text extraction (CRATE): an open-source software system , 2017, BMC Medical Informatics and Decision Making.

[21]  Stéphane M. Meystre,et al.  Text de-identification for privacy protection: A study of its impact on clinical text information content , 2014, J. Biomed. Informatics.