A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks

Recently, Large Language Models (LLM) have demonstrated impressive capability to solve a wide range of tasks. However, despite their success across various tasks, no prior work has investigated their capability in the biomedical domain yet. To this end, this paper aims to evaluate the performance of LLMs on benchmark biomedical tasks. For this purpose, we conduct a comprehensive evaluation of 4 popular LLMs in 6 diverse biomedical tasks across 26 datasets. To the best of our knowledge, this is the first work that conducts an extensive evaluation and comparison of various LLMs in the biomedical domain. Interestingly, we find based on our evaluation that in biomedical datasets that have smaller training sets, zero-shot LLMs even outperform the current state-of-the-art fine-tuned biomedical models. This suggests that pretraining on large text corpora makes LLMs quite specialized even in the biomedical domain. We also find that not a single LLM can outperform other LLMs in all tasks, with the performance of different LLMs may vary depending on the task. While their performance is still quite poor in comparison to the biomedical models that were fine-tuned on large training sets, our findings demonstrate that LLMs have the potential to be a valuable tool for various biomedical tasks that lack large annotated data.

[1]  Carolina Scarton,et al.  Overview of the BioLaySumm 2023 Shared Task on Lay Summarization of Biomedical Research Articles , 2023, The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks.

[2]  Beiji Zou,et al.  Research on named entity recognition of Traditional Chinese Medicine chest discomfort cases incorporating domain vocabulary features , 2023, Comput. Biol. Medicine.

[3]  P. Hao,et al.  Artificial intelligence-assisted dermatology diagnosis: From unimodal to multimodal , 2023, Comput. Biol. Medicine.

[4]  Eric Michael Smith,et al.  Llama 2: Open Foundation and Fine-Tuned Chat Models , 2023, ArXiv.

[5]  Nelson F. Liu,et al.  Lost in the Middle: How Language Models Use Long Contexts , 2023, TACL.

[6]  Nelson R. C. Monteiro,et al.  FSM-DDTR: End-to-end feedback strategy for multi-objective De Novo drug design using transformers , 2023, Comput. Biol. Medicine.

[7]  Md Tahmid Rahman Laskar,et al.  Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers , 2023, BIONLP.

[8]  Shafiq R. Joty,et al.  A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets , 2023, ACL.

[9]  Andrew M. Dai,et al.  PaLM 2 Technical Report , 2023, ArXiv.

[10]  Tao Zhou,et al.  Deep learning methods for medical image fusion: A review , 2023, Comput. Biol. Medicine.

[11]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[12]  Wei Cheng,et al.  Exploring the Limits of ChatGPT for Query or Aspect-based Text Summarization , 2023, ArXiv.

[13]  Michihiro Yasunaga,et al.  Is ChatGPT a General-Purpose Natural Language Processing Task Solver? , 2023, EMNLP.

[14]  Dan Su,et al.  A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity , 2023, IJCNLP.

[15]  J. Soar,et al.  Natural Language Processing in Electronic Health Records in relation to healthcare decision-making: A systematic review , 2023, Comput. Biol. Medicine.

[16]  Hyung Won Chung,et al.  Large language models encode clinical knowledge , 2022, Nature.

[17]  Carolina Scarton,et al.  Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature , 2022, EMNLP.

[18]  S. Ananiadou,et al.  Readability Controllable Biomedical Document Summarization , 2022, EMNLP.

[19]  Shenmin Zhang,et al.  BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining , 2022, Briefings Bioinform..

[20]  Sanjeevi Pandiyan,et al.  A comprehensive review on recent approaches for cancer drug discovery associated with artificial intelligence , 2022, Comput. Biol. Medicine.

[21]  Qiujie Lv,et al.  Recent progress in transformer-based medical image analysis , 2022, Comput. Biol. Medicine.

[22]  Md Tahmid Rahman Laskar,et al.  An Auto Encoder-based Dimensionality Reduction Technique for Efficient Entity Linking in Business Phone Conversations , 2022, SIGIR.

[23]  Hongyi Yuan,et al.  Generative Biomedical Entity Linking via Knowledge Base-Guided Pre-training and Synonyms-Aware Fine-tuning , 2022, North American Chapter of the Association for Computational Linguistics.

[24]  Ruyi Gan,et al.  BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model , 2022, BIONLP.

[25]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[26]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[27]  Dominic D. Martinelli Generative machine learning for de novo drug discovery: A systematic review , 2022, Comput. Biol. Medicine.

[28]  Jimmy Xiangji Huang,et al.  Domain Adaptation with Pre-trained Transformers for Query-Focused Abstractive Text Summarization , 2021, Computational Linguistics.

[29]  M. Housaindokht,et al.  An overview of progress from empirical to rational design in modern vaccine development, with an emphasis on computational tools and immunoinformatics approaches , 2021, Comput. Biol. Medicine.

[30]  P. Tiwari,et al.  Pre-trained Language Models in Biomedical Domain: A Systematic Survey , 2021, ACM Comput. Surv..

[31]  A. A. Zaidan,et al.  Multi-perspectives systematic review on the applications of sentiment analysis for vaccine hesitancy , 2021, Computers in Biology and Medicine.

[32]  Tao Qin,et al.  Discovering Drug-Target Interaction Knowledge from Biomedical Literature , 2021, Bioinform..

[33]  Dokyun Na,et al.  In silico methods and tools for drug discovery , 2021, Comput. Biol. Medicine.

[34]  Yu-Yen Ou,et al.  TRP-BERT: Discrimination of transient receptor potential (TRP) channels using contextual representations from deep bidirectional transformer based on BERT , 2021, Comput. Biol. Medicine.

[35]  Lan Lin,et al.  Multi-model and multi-slice ensemble learning architecture based on 2D convolutional neural networks for Alzheimer's disease diagnosis , 2021, Comput. Biol. Medicine.

[36]  Mohammad Ali Moni,et al.  Heart disease prediction using supervised machine learning algorithms: Performance analysis and comparison , 2021, Comput. Biol. Medicine.

[37]  Dimitrios I. Fotiadis,et al.  Deep learning for diabetic retinopathy detection and classification based on fundus images: A review , 2021, Comput. Biol. Medicine.

[38]  R. Chiong,et al.  Deep sequence modelling for Alzheimer's disease detection using MRI , 2021, Comput. Biol. Medicine.

[39]  Namrata Nath,et al.  The quest for better clinical word vectors: Ontology based and lexical vector augmentation versus clinical contextual embeddings , 2021, Comput. Biol. Medicine.

[40]  Syed Shujait Ali,et al.  Immunogenomics guided design of immunomodulatory multi-epitope subunit vaccine against the SARS-CoV-2 new variants, and its validation through in silico cloning and immune simulation , 2021, Computers in Biology and Medicine.

[41]  Katikapalli Subramanyam Kalyan,et al.  AMMU: A survey of transformer-based biomedical pretrained language models , 2021, J. Biomed. Informatics.

[42]  Xiaoqi Li,et al.  A State-of-the-art Survey of Artificial Neural Networks for Whole-slide Image Analysis: from Popular Convolutional Neural Networks to Potential Visual Transformers , 2021, Comput. Biol. Medicine.

[43]  Dina M. Ibrahim,et al.  Deep-chest: Multi-classification deep learning model for diagnosing COVID-19, pneumonia, and lung cancer chest diseases , 2021, Computers in Biology and Medicine.

[44]  Yu-Yen Ou,et al.  FAD-BERT: Improved prediction of FAD binding sites using pre-training of deep bidirectional transformers , 2021, Comput. Biol. Medicine.

[45]  T. Niesler,et al.  COVID-19 cough classification using machine learning and global smartphone recordings , 2020, Computers in Biology and Medicine.

[46]  Serkan Kiranyaz,et al.  Exploring the effect of image enhancement techniques on COVID-19 detection using chest X-ray images , 2020, Computers in Biology and Medicine.

[47]  Filippo Molinari,et al.  The impact of pre- and post-image processing techniques on deep learning frameworks: A comprehensive review for digital pathology image analysis , 2020, Comput. Biol. Medicine.

[48]  Su Ruan,et al.  Multi-task deep learning based CT imaging analysis for COVID-19 pneumonia: Classification and segmentation , 2020, Computers in Biology and Medicine.

[49]  James Caverlee,et al.  Infusing Disease Knowledge into BERT for Health Question Answering, Medical Inference and Disease Name Recognition , 2020, EMNLP.

[50]  Nigel Collier,et al.  COMETA: A Corpus for Medical Entity Linking in the Social Media , 2020, EMNLP.

[51]  Zhihan Zhou,et al.  DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome , 2020, bioRxiv.

[52]  Jianfeng Gao,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[53]  Yu Su,et al.  Document Classification for COVID-19 Literature , 2020, NLPCOVID19.

[54]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[55]  Xiangji Huang,et al.  Contextualized Embeddings based Transformer Encoder for Sentence Similarity Modeling in Answer Selection Task , 2020, LREC.

[56]  Guilherme Del Fiol,et al.  A scoping review of transfer learning research on medical image analysis using ImageNet , 2020, Comput. Biol. Medicine.

[57]  Pengtao Xie,et al.  MedDialog: A Large-scale Medical Dialogue Dataset , 2020, ArXiv.

[58]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[59]  Asma Ben Abacha,et al.  Question-driven summarization of answers to consumer health questions , 2020, Scientific Data.

[60]  C. Peng,et al.  Wnt/β-catenin signalling in ovarian cancer: Insights into its hyperactivation and function in tumorigenesis , 2019, Journal of ovarian research.

[61]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[62]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[63]  William W. Cohen,et al.  PubMedQA: A Dataset for Biomedical Research Question Answering , 2019, EMNLP.

[64]  Asma Ben Abacha,et al.  Overview of the MEDIQA 2019 Shared Task on Textual Inference, Question Entailment and Question Answering , 2019, BioNLP@ACL.

[65]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[66]  Asma Ben Abacha,et al.  On the Summarization of Consumer Health Questions , 2019, ACL.

[67]  Zhiyong Lu,et al.  Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets , 2019, BioNLP@ACL.

[68]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[69]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[70]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[71]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[72]  Chunyan Miao,et al.  A Survey of Zero-Shot Learning , 2019, ACM Trans. Intell. Syst. Technol..

[73]  C. Peng,et al.  Overview of MicroRNA Biogenesis, Mechanisms of Actions, and Circulation , 2018, Front. Endocrinol..

[74]  H. Mackay,et al.  Enhanced breast cancer progression by mutant p53 is inhibited by the circular RNA circ-Ccnb1 , 2018, Cell Death & Differentiation.

[75]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[76]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[77]  Anna Korhonen,et al.  Automatic semantic classification of scientific literature according to the hallmarks of cancer , 2016, Bioinform..

[78]  Zhiyong Lu,et al.  The CHEMDNER corpus of chemicals and drugs and its annotation principles , 2015, Journal of Cheminformatics.

[79]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[80]  Paloma Martínez,et al.  The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[81]  L. Jensen,et al.  The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text , 2013, PloS one.

[82]  Zhoujun Li,et al.  A Survival Modeling Approach to Biomedical Search Result Diversification Using Wikipedia , 2010, IEEE Transactions on Knowledge and Data Engineering.

[83]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[84]  Qinmin Hu,et al.  A bayesian learning approach to promoting diversity in ranking for biomedical information retrieval , 2009, SIGIR.

[85]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.

[86]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[87]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[88]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[89]  Lung-Hao Lee,et al.  NCUEE-NLP at BioLaySumm Task 2: Readability-Controlled Summarization of Biomedical Articles Using the PRIMERA Models , 2023, BIONLP.

[90]  Xiang Dai,et al.  CSIRO Data61 Team at BioLaySumm Task 1: Lay Summarisation of Biomedical Research Articles Using Generative Models , 2023, BIONLP.

[91]  Dina Demner-Fushman,et al.  Overview of the MEDIQA 2021 Shared Task on Summarization in the Medical Domain , 2021, BIONLP.

[92]  Emilia Farcas,et al.  A Gradually Soft Multi-Task and Data-Augmented Approach to Medical Question Understanding , 2021, ACL.

[93]  Malaikannan Sankarasubbu,et al.  BioELECTRA:Pretrained Biomedical text Encoder using Discriminators , 2021, BIONLP.

[94]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[95]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[96]  Xiangji Huang,et al.  York University at TREC 2005: Genomics Track , 2005, TREC.

[97]  A. Lawson Survival Modeling , 2022, Using R for Bayesian Spatial and Spatio-Temporal Health Modeling.