Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario

This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices, such as masking at word and subword level, varying the vocabulary size and testing with domain data, looking for better language representations. Interestingly, in the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross-domain transfer approaches to generate a performant bio-clinical model suitable for real-world clinical data. We evaluated our models on Named Entity Recognition (NER) tasks for biomedical documents and challenging hospital discharge reports. When compared against the competitive mBERT and BETO models, we outperform them in all NER tasks by a significant margin. Finally, we studied the impact of the model’s vocabulary on the NER performances by offering an interesting vocabulary-centric analysis. The results confirm that domain-specific pretraining is fundamental to achieving higher performances in downstream NER tasks, even within a mid-resource scenario. To the best of our knowledge, we provide the first biomedical and clinical transformer-based pretrained language models for Spanish, intending to boost native Spanish NLP applications in biomedicine. Our models will be made freely available after publication.

[1]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[2]  Rrubaa Panchendrarajan,et al.  Bidirectional LSTM-CRF for Named Entity Recognition , 2018, PACLIC.

[3]  Jieh Hsiang,et al.  PatentBERT: Patent Classification with Fine-Tuning a pre-trained BERT Model , 2019, ArXiv.

[4]  Pierre Zweigenbaum,et al.  Clinical Natural Language Processing in languages other than English: opportunities and challenges , 2018, Journal of Biomedical Semantics.

[5]  Ion Androutsopoulos,et al.  LEGAL-BERT: “Preparing the Muppets for Court’” , 2020, FINDINGS.

[6]  Dogu Araci,et al.  FinBERT: Financial Sentiment Analysis with Pre-trained Language Models , 2019, ArXiv.

[7]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[8]  Felipe Soares,et al.  Medical Word Embeddings for Spanish: Development and Evaluation , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[9]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[10]  Qingcai Chen,et al.  A Joint Model for Medical Named Entity Recognition and Normalization , 2020, IberLEF@SEPLN.

[11]  Montse Cuadros,et al.  Vicomtech at CANTEMIST 2020 , 2020, IberLEF@SEPLN.

[12]  Wanxiang Che,et al.  Pre-Training with Whole Word Masking for Chinese BERT , 2019, ArXiv.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Wahed Hemati,et al.  When Specialization Helps: Using Pooled Contextualized Embeddings to Detect Chemical and Biomedical Entities in Spanish , 2019, EMNLP.

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[17]  Oladimeji Farri,et al.  Clinical NER using Spanish BERT Embeddings , 2020, IberLEF@SEPLN.

[18]  Katikapalli Subramanyam Kalyan,et al.  AMMU: A survey of transformer-based biomedical pretrained language models , 2021, J. Biomed. Informatics.

[19]  Aitor Gonzalez-Agirre,et al.  Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models , 2021, ArXiv.

[20]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[21]  Xiaodong Liu,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[22]  Douglas Teodoro,et al.  BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition , 2020, CLINICALNLP.

[23]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[24]  H. T. Kung,et al.  exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources , 2020, FINDINGS.

[25]  Montserrat Marimon,et al.  PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track , 2019, EMNLP.

[26]  Liliya Akhtyamova,et al.  Named Entity Recognition in Spanish Biomedical Literature: Short Review and Bert Model , 2020, 2020 26th Conference of Open Innovations Association (FRUCT).

[27]  Qingcai Chen,et al.  A Deep Learning-Based System for PharmaCoNER , 2019, BioNLP-OST@EMNLP-IJNCLP.

[28]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[29]  Zhiyong Lu,et al.  Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets , 2019, BioNLP@ACL.

[30]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[31]  Paloma Martínez,et al.  Testing Contextualized Word Embeddings to Improve NER in Spanish Clinical Case Narratives , 2020, IEEE Access.