IXA-NER-RE at eHealth-KD Challenge 2020

The eHealth-KD 2020 set out this year an automatic extraction challenge on a coarse range of knowledge from health documents written in the Spanish Language. Our group has participated in all the proposed scenarios; the main one, the Named Entity Recognition (NER) subtask, the Relation Extraction (RE) subtask, and the alternative domain obtaining very different results in each of them. The main task has been conceived as a pipeline of the NER and RE subtask, each of them independently developed from the other. The Name Entity Recognition task has been envisaged as a basic seq2seq system applying a general-purpose Language Model and static embeddings. Unlike the NER subtask, in the RE subtask several approaches were successfully explored; first, transfer learning methods as a way to measure the adaptation ability of pre-trained language models to both medical domain and Spanish language. Second, Matching the Blanks to tackle the problem of the reduced size of the training corpus by producing relation representations directly from non tagged text. As mentioned, the results in the different task were heterogeneous; while the result in NER is on the average (F1 0.66), with ample room for improvement, the result in RE has been outstanding, obtaining the first place in this task (F1 0.633) with more than 3 points over the next classified, demonstrating the soundness of the proposed techniques.

[1]  Inanç Birol,et al.  In-domain Context-aware Token Embeddings Improve Biomedical Named Entity Recognition , 2018, Louhi@EMNLP.

[2]  Felipe Soares,et al.  Medical Word Embeddings for Spanish: Development and Evaluation , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[3]  Rafael Muñoz,et al.  Overview of the eHealth Knowledge Discovery Challenge at IberLEF 2019 , 2021, IberLEF@SEPLN.

[4]  Lluís Padró,et al.  FreeLing 3.0: Towards Wider Multilinguality , 2012, LREC.

[5]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[6]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[7]  Roy Schwartz,et al.  Knowledge Enhanced Contextual Word Representations , 2019, EMNLP/IJCNLP.

[8]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[9]  Luke S. Zettlemoyer,et al.  Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[10]  Jingqi Wang,et al.  Enhancing Clinical Concept Extraction with Contextual Embedding , 2019, J. Am. Medical Informatics Assoc..

[11]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[12]  Paloma Martínez,et al.  Testing Contextualized Word Embeddings to Improve NER in Spanish Clinical Case Narratives , 2020, IEEE Access.

[13]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[14]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[17]  Sampo Pyysalo,et al.  Size (and Domain) Matters: Evaluating Semantic Word Space Representations for Biomedical Text , 2012 .

[18]  Oier Lopez de Lacalle,et al.  Domain Adapted Distant Supervision for Pedagogically Motivated Relation Extraction , 2020, LREC.

[20]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[21]  Jeffrey Ling,et al.  Matching the Blanks: Distributional Similarity for Relation Learning , 2019, ACL.

[22]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[23]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.