论文信息 - Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models

Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models

We introduce CoWeSe (the Corpus Web Salud Español), the largest Spanish biomedical corpus to date, consisting of 4.5GB (about 750M tokens) of clean plain text. CoWeSe is the result of a massive crawler on 3000 Spanish domains executed in 2020. The corpus is openly available and already preprocessed. CoWeSe is an important resource for biomedical and health NLP in Spanish and has already been employed to train domain-specific language models and to produce word embbedings. We released the CoWeSe corpus under a Creative Commons Attribution 4.0 International license in Zenodo (https://doi.org/10.5281/zenodo.4561970).

[1] Aitor Gonzalez-Agirre,et al. Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan , 2021, FINDINGS.

[2] Pedro Ortiz Suarez,et al. A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages , 2020, Annual Meeting of the Association for Computational Linguistics.

[3] Jaewoo Kang,et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[4] Martin Krallinger,et al. Named Entity Recognition, Concept Normalization and Clinical Coding: Overview of the Cantemist Track for Cancer Text Mining in Spanish, Corpus, Guidelines, Methods and Results , 2020, IberLEF@SEPLN.

[5] Montserrat Marimon,et al. PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track , 2019, EMNLP.

[6] Thomas Wolf,et al. Transfer Learning in Natural Language Processing , 2019, NAACL.

[7] Iz Beltagy,et al. SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[8] Felipe Soares,et al. Medical Word Embeddings for Spanish: Development and Evaluation , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[9] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .