论文信息 - De-Identification of German Medical Admission Notes

De-Identification of German Medical Admission Notes

Medical texts are a vast resource for medical and computational research. In contrast to newswire or wikipedia texts medical texts need to be de-identified before making them accessible to a wider NLP research community. We created a prototype for German medical text de-identification and named entity recognition using a three-step approach. First, we used well known rule-based models based on regular expressions and gazetteers, second we used a spelling variant detector based on Levenshtein distance, exploiting the fact that the medical texts contain semi-structured headers including sensible personal data, and third we trained a named entity recognition model on out of domain data to add statistical capabilities to our prototype. Using a baseline based on regular expressions and gazetteers we could improve F2-score from 78% to 85% for de-identification. Our prototype is a first step for further research on German medical text de-identification and could show that using spelling variant detection and out of domain trained statistical models can improve de-identification performance significantly.

Stefan Riezler | Christoph Dieterich | Phillip Richter-Pechanski

[1] Manaal Faruqui,et al. Training and Evaluating a German Named Entity Recognizer with Semantic Generalization , 2010, KONVENS.

[2] Alexander A. Morgan,et al. Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records , 2007, J. Am. Medical Informatics Assoc..

[3] Shuying Shen,et al. BoB, a best-of-breed automated text de-identification system for VHA clinical documents , 2013, J. Am. Medical Informatics Assoc..

[4] Uwe K. Schneider. Sekundärnutzung klinischer Daten – Rechtliche Rahmenbedingungen , 2015 .

[5] Xiaolong Wang,et al. De-identification of clinical notes via recurrent neural network and conditional random field. , 2017, Journal of biomedical informatics.

[6] Peter Szolovits,et al. A de-identifier for medical discharge summaries , 2008, Artif. Intell. Medicine.

[7] Özlem Uzuner,et al. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1 , 2015, J. Biomed. Informatics.

[8] L. Sweeney. Replacing personally-identifying information in medical records, the Scrub system. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[9] Li Xiong,et al. HIDE: An Integrated System for Health Information DE-identification , 2008, 2008 21st IEEE International Symposium on Computer-Based Medical Systems.

[10] Hwee Tou Ng,et al. Automated Anonymization as Spelling Variant Detection , 2016, ClinicalNLP@COLING 2016.

[11] Kim Luyckx,et al. De-Identification of Clinical Free Text in Dutch with Limited Training Data: A Case Study , 2013, RANLP.

[12] Ulf Leser,et al. How to improve information extraction from German medical records , 2017, it Inf. Technol..