Automatic De-identification of Medical Texts in Spanish: the MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results

There is an increasing interest in exploiting the content of electronic health records by means of natural language processing and text-mining technologies, as they can result in resources for improving patient health/safety, aid in clinical decision making, facilitate drug repurposing or precision medicine. To share, re-distribute and make clinical narratives accessible for text mining research purposes, it is key to fulfill legal conditions and address restrictions related data protection and patient privacy. Thus, clinical records cannot be shared directly ”as is”. A necessary precondition for accessing clinical records outside of hospitals is their de-identification or exhaustive removal/replacement of all mentioned privacy related protected health information phrases. Providing a proper evaluation scenario for automatic anonymization tools is key for approval of data redistribution. The construction of manually de-identified medical records is currently the main rate and cost-limiting step for secondary use applications of clinical data. This paper summarizes the settings, data and results of the first shared track on anonymization of medical documents in Spanish, the MEDDOCAN (Medical Document Anonymization) track. This track relied on a carefully constructed synthetic corpus of clinical case documents, the MEDDOCAN corpus, following annotation guidelines for sensitive data based on the analysis of the EU General Data Protection Regulation. A total of 18 teams (from the 51 registrations) submitted 63 runs for first sub-track 1 and 61 systems for the second sub-track. The top scoring systems were based on sophisticated deep learning approaches, representing strategies that can significantly reduce time and costs associated to accessing textual data containing privacy-related sensitive information. The results of this track might help in lowering the clinical data access hurdle for Spanish language technology developers, showing also potentials for similar settings using data in other languages or from different domains.

[1]  Eckhard Bick,et al.  Automatic Anonymisation of a new Portuguese-English Parallel Corpus in the Legal-Financial Domain , 2015 .

[2]  Cyril Grouin,et al.  De-identification of clinical notes in French: towards a protocol for reference corpus development , 2014, J. Biomed. Informatics.

[3]  Kostas Pantazos,et al.  Preserving medical correctness, readability and consistency in de-identified health records , 2017, Health Informatics J..

[4]  Montserrat Marimon,et al.  The MeSpEN Resource for English-Spanish Medical Machine Translation and Terminologies : Census of Parallel Corpora , Glossaries and Term Translations , 2018 .

[5]  Montserrat Marimon,et al.  PharmacoNER Tagger: a deep learning-based tool for automatically finding chemicals and drugs in Spanish medical texts , 2019, Genomics & informatics.

[6]  Sara Hajian,et al.  A Case Study of Anonymization of Medical Surveys , 2017, DH.

[7]  Özlem Uzuner,et al.  Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1 , 2015, J. Biomed. Informatics.

[8]  Julia Prentice,et al.  Learner Corpus Anonymization in the Age of GDPR: Insights from the Creation of a Learner Corpus of Swedish , 2018 .

[9]  Martin Krallinger,et al.  Construcción de recursos terminológicos médicos para el espa˜nol: el sistema de extracción de términos CUTEXT y los repositorios de términos biomédicos , 2018, Proces. del Leng. Natural.

[10]  Kim Luyckx,et al.  De-Identification of Clinical Free Text in Dutch with Limited Training Data: A Case Study , 2013, RANLP.

[11]  José Luis Fernández Alemán,et al.  Security and privacy in electronic health records: A systematic literature review , 2013, J. Biomed. Informatics.

[12]  Montserrat Marimon,et al.  Finding Mentions of Abbreviations and Their Definitions in Spanish Clinical Cases: The BARR2 Shared Task Evaluation Results , 2018, IberEval@SEPLN.

[13]  Alfonso Valencia,et al.  The Biomedical Abbreviation Recognition and Resolution (BARR) Track: Benchmarking, Evaluation and Importance of Abbreviation Recognition Systems Applied to Spanish Biomedical Abstracts , 2017, IberEval@SEPLN.

[14]  Laura García Sardiña Automating the anonymisation of textual corpora , 2018 .

[15]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[16]  Alfonso Valencia,et al.  Next generation community assessment of biomedical entity recognition web servers: metrics, performance, interoperability aspects of BeCalm , 2019, Journal of Cheminformatics.

[17]  Hercules Dalianis,et al.  Pseudonymisation of Personal Names and other PHIs in an Annotated Clinical Swedish Corpus , 2012 .

[18]  Amund Tveita,et al.  Anonymization of General Practioner Medical Records , 2004 .

[19]  Christian Lovis,et al.  De-identification of French medical narratives , 2018, Swiss Medical Informatics.

[20]  Jorge Baptista,et al.  Automated anonymization of text documents , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[21]  Jorge Turmo Borras,et al.  Building a Spanish/Catalan health records corpus with very sparse protected information labelled , 2018 .