Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT

The availability of biomedical text data and advances in natural language processing (NLP) have made new applications in biomedical NLP possible. Language models trained or fine-tuned using domain-specific corpora can outperform general models, but work to date in biomedical NLP has been limited in terms of corpora and tasks. We present BioALBERT, a domain-specific adaptation of A Lite Bidirectional Encoder Representations from Transformers (ALBERT), trained on biomedical (PubMed and PubMed Central) and clinical (MIMIC-III) corpora and fine-tuned for 6 different tasks across 20 benchmark datasets. Experiments show that BioALBERT outperforms the state-of-the-art on named-entity recognition (+11.09% BLURB score improvement), relation extraction (+0.80% BLURB score), sentence similarity (+1.05% BLURB score), document classification (+0.62% F1-score), and question answering (+2.83% BLURB score). It represents a new state-of-the-art in 17 out of 20 benchmark datasets. By making BioALBERT models and data available, our aim is to help the biomedical NLP community avoid computational costs of training and establish a new set of baselines for future efforts across a broad range of biomedical NLP tasks. Background & Summary The growing volume of the published biomedical literature, such as clinical reports1 and health literacy2 demands more precise and generalized biomedical natural language processing (BioNLP) tools for information extraction. The recent advancement of using deep learning (DL) in natural language processing (NLP) has fueled the advancements in the development of pre-trained language models (LMs) that can be applied to a range of tasks in the BioNLP domains3. However, directly fine-tuning of the state-of-the-art (SOTA) pre-trained LMs for bioNLP tasks, like Embeddings from Language Models (ELMo)4, Bidirectional Encoder Representations from Transformers (BERT)5 and A Lite Bidirectional Encoder Representations from Transformers (ALBERT)6, yielded poor performances because these LMs were trained on general domain corpus (e.g. Wikipedia, Bookcorpus etc.), and were not designed for the requirements of biomedical documents that comprise of different word distribution, and having complex relationship7. To overcome this limitation, BioNLP researchers have trained LMs on biomedical and clinical corpus and proved its effectiveness on various downstream tasks in BioNLP tasks8–15. Jin et al.9 trained biomedical ELMo (BioELMo) with PubMed abstracts and found features extracted by BioELMo contained entity-type and relational information relevant to the biomedical corpus. Beltagy et al.11 trained BERT on scientific texts and published the trained model as Scientific BERT (SciBERT). Similarly, Si et al.10 used task-specific models and enhanced traditional non-contextual and contextual word embedding methods for biomedical named-entity-recognition (NER) by training BERT on clinical notes corpora. Peng et al. 12 presented a BLUE (Biomedical Language Understanding Evaluation) benchmark by designing 5 tasks with 10 datasets for analysing natural biomedical LMs. They also showed that BERT models pre-trained on PubMed abstracts and clinical notes outperformed other models which were trained on general corpora. The most popular biomedical pre-trained LMs is BioBERT (BERT for Biomedical Text Mining)13 which was trained on PubMed and PubMed Central (PMC) corpus and fine-tuned on 3 BioNLP tasks including NER, Relation Extraction (RE) and Question Answering (QA). Gu et al.14 developed PubMedBERT by training from scratch on PubMed articles and showed performance gained over models trained on general corpora. They developed a domain-specific vocabulary from PubMed articles and demonstrated a boost in performance on the domain-specific task. Another biomedical pre-trained LM is KeBioLM15 which leveraged knowledge from the UMLS (Unified Medical Language System) bases. KeBioLM was applied to 2 BioNLP tasks. Table 1 summarises a number of datasets previously used to evaluate Pre-trained LMs on various BioNLP tasks. Our previous preliminary work has shown the potential of designing a customised domain-specific LM outperforming SOTA in NER tasks16. With all these pre-trained LMs adopting BERT architecture, its’ training is slow and requires huge computational resources. Further, all these LMs were demonstrated with selected BioNLP tasks, and therefore their generalizability is unproven. ar X iv :2 10 7. 04 37 4v 1 [ cs .C L ] 9 J ul 2 02 1 Table 1. Comparison of the biomedical datasets in prior language model pretraining studies and ours (BioALBERT) – a biomedical version ALBERT language model Datasets BioBERT13 SciBERT11 BLUE12 PubMedBERT14 KeBioLM15 BioALBERT Share/Clefe17 × × X × × X BC5CDR (Disease)18 X X X X X X BC5CDR (Chemical)18 X X X X X X JNLPBA19 X × × X X X LINNAEUS20 X × × × × X NCBI (Disease)21 X X × X X X Species-800 (S800)22 X × × × × X BC2GM23 X × × X X X DDI24 × × X X X X ChemProt7 X X X X X X i2b225 × × X × × X Euadr26 X × × × × X GAD27 X × × X X X BIOSSES28 × × X X × X MedSTS29 × × X × × X MedNLI30 × × X × × X HoC31 × × X X × X BioASQ 4b32 X × × X × X BioASQ 5b32 X × × X × X BioASQ 6b32 X × × X × X Furthermore, these LMs are trained on limited domain-specific corpora, whereas some tasks contain both clinical and biomedical terms, so training with broader coverage of domain-specific corpora can improve performance. ALBERT has been shown to be a superior model compared to BERT in NLP tasks6, and we suggest that this model can be trained to improve BioNLP tasks as shown with BERT. In this study, we hypothesize that training ALBERT on biomedical (PubMed and PMC) and clinical notes (MIMIC-III) corpora can be more effective and computationally efficient in BioNLP tasks as compared to other SOTA methods. We present biomedical ALBERT (BioALBERT), a new LM designed and optimized to benchmark performance on a range of BioNLP tasks. BioALBERT is based on ALBERT and trained on a large corpus of biomedical and clinical texts. We fined-tuned and compared the performance of BioALBERT on 6 BioNLP tasks with 20 biomedical and clinical benchmark datasets with different sizes and complexity. Compared with most existing BioNLP LMs that are mainly focused on limited tasks, our BioALBERT achieved SOTA performance on 5 out of 6 BioNLP tasks in 17 out of 20 tested datasets. BioALBERT achieved higher performance in NER, RE, Sentence similarity, Document classification and a higher Accuracy (lenient) score in QA than the current SOTA LMs. To facilitate developments in the important BioNLP community, we make the pre-trained BioALBERT LMs and the source code for fine-tuning BioALBERT publicly available1.

[1]  Zhiyong Lu,et al.  Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets , 2019, BioNLP@ACL.

[2]  Jingqi Wang,et al.  Enhancing Clinical Concept Extraction with Contextual Embedding , 2019, J. Am. Medical Informatics Assoc..

[3]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[4]  Georgios Balikas,et al.  An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , 2015, BMC Bioinformatics.

[5]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[6]  Laura Inés Furlong,et al.  The EU-ADR corpus: Annotated drugs, diseases, targets, and their relationships , 2012, J. Biomed. Informatics.

[7]  Tapio Salakoski,et al.  Distributional Semantics Resources for Biomedical Text Processing , 2013 .

[8]  Arzucan Özgür,et al.  BIOSSES: a semantic sentence similarity estimation system for the biomedical domain , 2017, Bioinform..

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Anna Korhonen,et al.  Automatic semantic classification of scientific literature according to the hallmarks of cancer , 2016, Bioinform..

[11]  Rie Kubota Ando,et al.  BioCreative II Gene Mention Tagging System at IBM Watson , 2007 .

[12]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[13]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[14]  Anália Lourenço,et al.  Overview of the BioCreative VI chemical-protein interaction Track , 2017 .

[15]  Joyce Y. Chai,et al.  Recent Advances in Natural Language Inference: A Survey of Benchmarks, Resources, and Approaches , 2019 .

[16]  Fei Huang,et al.  Improving Biomedical Pretrained Language Models with Knowledge , 2021, BIONLP.

[17]  Xiaodong Liu,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[18]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[19]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[20]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[21]  Alexey Romanov,et al.  Lessons from Natural Language Inference in the Clinical Domain , 2018, EMNLP.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Matloob Khushi,et al.  BioALBERT: A Simple and Effective Pre-trained Language Model for Biomedical Named Entity Recognition , 2020, 2021 International Joint Conference on Neural Networks (IJCNN).

[24]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[25]  Paloma Martínez,et al.  The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[26]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[27]  William W. Cohen,et al.  Probing Biomedical Embeddings from Language Models , 2019, Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for.

[28]  Núria Queralt-Rosinach,et al.  Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research , 2014, BMC Bioinformatics.

[29]  Gary D. Bader,et al.  Transfer learning for biomedical named entity recognition with neural networks , 2018, bioRxiv.

[30]  Lena Mårtensson,et al.  Health literacy -- a heterogeneous phenomenon: a literature review. , 2012, Scandinavian journal of caring sciences.

[31]  L. Jensen,et al.  The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text , 2013, PloS one.

[32]  Hongfang Liu,et al.  MedSTS: a resource for clinical semantic textual similarity , 2018, Language Resources and Evaluation.