LSI2_UNED at eHealth-KD Challenge 2019: A Few-shot Learning Model for Knowledge Discovery from eHealth Documents

In this work, we describe a Few-Shot Learning approach for Named Entity Recognition (NER) in eHealth documents to identify and classify key phrases in a document (subtask A in the IberLEF eHealthKD 2019 competition [10]). The architecture is an hybrid Bi-LSTM and CNN model with four input layers that can recognize multi-word entities using the BIO encoding format for the labels. The system obtained a F-score of 73.15% (baseline is 54,66%), with a 78,17% of precision, according to the eHealth-KD evaluation procedure. This improvement is reached mainly because (a) the correct selection of the hybrid model for NER that obtains better results using a POS tagger and (2) the addition of Wikidata entities to extend the vocabulary that improves the precision by nearly 10%.

[1]  L. F. Rau,et al.  Extracting company names from text , 1991, [1991] Proceedings. The Seventh IEEE Conference on Artificial Intelligence Application.

[2]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[3]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[4]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[5]  Tianxi Cai,et al.  Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data , 2018, PSB.

[6]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[7]  Paloma Martínez,et al.  A Hybrid Bi-LSTM-CRF model for Knowledge Recognition from eHealth documents , 2018, TASS@SEPLN.

[8]  Pietro Perona,et al.  One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Paloma Martínez,et al.  Simplifying drug package leaflets written in Spanish by using word embedding , 2017, Journal of Biomedical Semantics.

[10]  Hiroyuki Shindo,et al.  Wikipedia2Vec: An Optimized Implementation for Learning Embeddings from Wikipedia , 2018 .

[11]  Ana M. García-Serrano,et al.  HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset , 2017, Inf. Syst..

[12]  Rafael Muñoz,et al.  Overview of the eHealth Knowledge Discovery Challenge at IberLEF 2019 , 2021, IberLEF@SEPLN.

[13]  Andrey Kormilitzin,et al.  Few-shot Learning for Named Entity Recognition in Medical Text , 2018, ArXiv.

[14]  Ana M. García-Serrano,et al.  Formal concept analysis for topic detection: A clustering quality experimental analysis , 2017, Inf. Syst..

[15]  Juan Martínez-Romo,et al.  Deep neural models for extracting entities and relationships in the new RDD corpus relating disabilities and rare diseases , 2018, Comput. Methods Programs Biomed..

[16]  Christoph H. Lampert,et al.  Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[18]  Ana M. García-Serrano,et al.  Experiences at ImageCLEF 2010 using CBIR and TBIR Mixing Information Approaches , 2010, CLEF.