De-Identification of German Medical Admission Notes

Medical texts are a vast resource for medical and computational research. In contrast to newswire or wikipedia texts medical texts need to be de-identified before making them accessible to a wider NLP research community. We created a prototype for German medical text de-identification and named entity recognition using a three-step approach. First, we used well known rule-based models based on regular expressions and gazetteers, second we used a spelling variant detector based on Levenshtein distance, exploiting the fact that the medical texts contain semi-structured headers including sensible personal data, and third we trained a named entity recognition model on out of domain data to add statistical capabilities to our prototype. Using a baseline based on regular expressions and gazetteers we could improve F2-score from 78% to 85% for de-identification. Our prototype is a first step for further research on German medical text de-identification and could show that using spelling variant detection and out of domain trained statistical models can improve de-identification performance significantly.

[1]  Manaal Faruqui,et al.  Training and Evaluating a German Named Entity Recognizer with Semantic Generalization , 2010, KONVENS.

[2]  Alexander A. Morgan,et al.  Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records , 2007, J. Am. Medical Informatics Assoc..

[3]  Shuying Shen,et al.  BoB, a best-of-breed automated text de-identification system for VHA clinical documents , 2013, J. Am. Medical Informatics Assoc..

[4]  Uwe K. Schneider Sekundärnutzung klinischer Daten – Rechtliche Rahmenbedingungen , 2015 .

[5]  Xiaolong Wang,et al.  De-identification of clinical notes via recurrent neural network and conditional random field. , 2017, Journal of biomedical informatics.

[6]  Peter Szolovits,et al.  A de-identifier for medical discharge summaries , 2008, Artif. Intell. Medicine.

[7]  Özlem Uzuner,et al.  Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1 , 2015, J. Biomed. Informatics.

[8]  L. Sweeney Replacing personally-identifying information in medical records, the Scrub system. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[9]  Li Xiong,et al.  HIDE: An Integrated System for Health Information DE-identification , 2008, 2008 21st IEEE International Symposium on Computer-Based Medical Systems.

[10]  Hwee Tou Ng,et al.  Automated Anonymization as Spelling Variant Detection , 2016, ClinicalNLP@COLING 2016.

[11]  Kim Luyckx,et al.  De-Identification of Clinical Free Text in Dutch with Limited Training Data: A Case Study , 2013, RANLP.

[12]  Ulf Leser,et al.  How to improve information extraction from German medical records , 2017, it Inf. Technol..