SINAI at eHealth-KD Challenge 2020

This paper describes the system presented by the SINAI research group to the eHealth-KD challenge at IberLEF 2020. Two main subtasks for knowledge discovery in Spanish medical records were defined: entity recognition and relationship extraction. In the Natural Language Processing (NLP) field, Named Entity Recognition (NER) may be formulated as a sequence labeling problem where the text is treated as a sequence of words to be labeled with linguistic tags. Since current state-of-the-art approaches for sequence labeling typically use Recurrent Neural Networks (RNN), our proposal employs a BiLSTM+CRF neural network where different word embeddings are combined as an input to the architecture. Thus we could test the performance of different types of word embeddings for the NER task in Spanish medical records: own-generated medical embeddings, contextualized non-medical embeddings, and pre-trained non-medical embeddings based on transformers. The obtained results for the entity recognition task achieved the highest F1-score among all the participants, while those obtained for the relationship extraction task show that the proposed approach needs to be further explored.

[1]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[2]  Paloma Martínez,et al.  Simplifying drug package leaflets written in Spanish by using word embedding , 2017, Journal of Biomedical Semantics.

[3]  Maite Martin,et al.  Using Machine Learning and Deep Learning Methods to Find Mentions of Adverse Drug Reactions in Social Media , 2019, Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task.

[4]  Luis Alfonso Ureña López,et al.  SINAI en TASS 2018 Task 3. Clasificando acciones y conceptos con UMLS en MedLine (SINAI in TASS 2018 Task 3. Classifying actions and concepts with UMLS on MedLine) , 2018, TASS@SEPLN.

[5]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[6]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[7]  Roland Vollgraf,et al.  FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP , 2019, NAACL.

[8]  Beth Sundheim,et al.  MUC-5 Evaluation Metrics , 1993, MUC.

[9]  Alicia Pérez,et al.  Word embeddings for negation detection in health records written in Spanish , 2018, Soft Comput..

[10]  Comisión de Candidaturas,et al.  ORGANIZACIÓN MUNDIAL DE LA SALUD , 1999 .

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[13]  Roland Vollgraf,et al.  Pooled Contextualized Embeddings for Named Entity Recognition , 2019, NAACL.

[14]  Sebastian Thrun,et al.  Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[15]  Rafael Muñoz,et al.  Overview of the eHealth Knowledge Discovery Challenge at IberLEF 2019 , 2021, IberLEF@SEPLN.

[16]  Rafael Muñoz,et al.  Analysis of eHealth knowledge discovery systems in the TASS 2018 workshop , 2019, Proces. del Leng. Natural.

[17]  Luis Alfonso Ureña López,et al.  Using Snomed to recognize and index chemical and drug mentions , 2019, BioNLP-OST@EMNLP-IJNCLP.

[18]  Montserrat Marimon,et al.  The MeSpEN Resource for English-Spanish Medical Machine Translation and Terminologies : Census of Parallel Corpora , Glossaries and Term Translations , 2018 .

[19]  Chandra Bhagavatula,et al.  Semi-supervised sequence tagging with bidirectional language models , 2017, ACL.

[20]  Felipe Soares,et al.  Medical Word Embeddings for Spanish: Development and Evaluation , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[21]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[24]  Manuel Carlos Díaz-Galiano,et al.  An Integrated Approach to Biomedical Term Identification Systems , 2020, Applied Sciences.

[25]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[26]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[27]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[28]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.