NLNDE: The Neither-Language-Nor-Domain-Experts' Way of Spanish Medical Document De-Identification

Natural language processing has huge potential in the medical domain which recently led to a lot of research in this field. However, a prerequisite of secure processing of medical documents, e.g., patient notes and clinical trials, is the proper de-identification of privacy-sensitive information. In this paper, we describe our NLNDE system, with which we participated in the MEDDOCAN competition, the medical document anonymization task of IberLEF 2019. We address the task of detecting and classifying protected health information from Spanish data as a sequence-labeling problem and investigate different embedding methods for our neural network. Despite dealing in a non-standard language and domain setting, the NLNDE system achieves promising results in the competition.

[1]  Felipe Soares,et al.  Medical Word Embeddings for Spanish: Development and Evaluation , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[2]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[3]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[4]  Xiaolong Wang,et al.  De-identification of clinical notes via recurrent neural network and conditional random field. , 2017, Journal of biomedical informatics.

[5]  Özlem Uzuner,et al.  Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus , 2015, J. Biomed. Informatics.

[6]  Aitor Gonzalez-Agirre,et al.  Automatic De-identification of Medical Texts in Spanish: the MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results , 2019, IberLEF@SEPLN.

[7]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[9]  Heike Adel,et al.  Neural Semi-Markov Conditional Random Fields for Robust Character-Based Part-of-Speech Tagging , 2019, NAACL-HLT.

[10]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[11]  Roland Vollgraf,et al.  Pooled Contextualized Embeddings for Named Entity Recognition , 2019, NAACL.

[12]  Rema Padman,et al.  A Deep Learning Architecture for De-identification of Patient Notes: Implementation and Evaluation , 2018, ArXiv.

[13]  Jonathan M. Garibaldi,et al.  Automatic detection of protected health information from clinic narratives , 2015, J. Biomed. Informatics.

[14]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.