Boosting Transformers for Job Expression Extraction and Classification in a Low-Resource Setting

In this paper, we explore possible improvements of transformer models in a low-resource setting. In particular, we present our approaches to tackle the first two of three subtasks of the MEDDOPROF competition, i.e., the extraction and classification of job expressions in Spanish clinical texts. As neither language nor domain experts, we experiment with the multilingual XLM-R transformer model and tackle these low-resource information extraction tasks as sequence-labeling problems. We explore domainand language-adaptive pretraining, transfer learning and strategic datasplits to boost the transformer model. Our results show strong improvements using these methods by up to 5.3 F1 points compared to a fine-tuned XLM-R model. Our best models achieve 83.2 and 79.3 F1 for the first two tasks, respectively.

[1]  Aitor Gonzalez-Agirre,et al.  Automatic De-identification of Medical Texts in Spanish: the MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results , 2019, IberLEF@SEPLN.

[2]  Heike Adel,et al.  ClusterDataSplit: Exploring Challenging Clustering-Based Data Splits for Model Performance Evaluation , 2020, EVAL4NLP.

[3]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[4]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5]  Anders Sogaard,et al.  We Need To Talk About Random Splits , 2020, EACL.

[6]  Heike Adel,et al.  To Share or not to Share: Predicting Sets of Sources for Model Transfer Learning , 2021, EMNLP.

[7]  Heike Adel,et al.  The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain , 2020, ACL.

[8]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[9]  Lukas. Lange,et al.  NLNDE: Enhancing Neural Sequence Taggers with Attention and Noisy Channel for Robust Pharmacological Entity Detection , 2019, EMNLP.

[10]  Kyle Gorman,et al.  We Need to Talk about Standard Splits , 2019, ACL.

[11]  Heike Adel,et al.  A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios , 2020, NAACL.

[12]  Heike Adel,et al.  NLNDE: The Neither-Language-Nor-Domain-Experts' Way of Spanish Medical Document De-Identification , 2020, IberLEF@SEPLN.

[13]  Montserrat Marimon,et al.  The MeSpEN Resource for English-Spanish Medical Machine Translation and Terminologies : Census of Parallel Corpora , Glossaries and Term Translations , 2018 .

[14]  Montserrat Marimon,et al.  PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track , 2019, EMNLP.

[15]  Martin Krallinger,et al.  NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts , 2021, Proces. del Leng. Natural.

[16]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[17]  Heike Adel,et al.  NLNDE at CANTEMIST: Neural Sequence Labeling and Parsing Approaches for Clinical Concept Extraction , 2020, IberLEF@SEPLN.