Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers

ChatGPT is a large language model developed by OpenAI. Despite its impressive performance across various tasks, no prior work has investigated its capability in the biomedical domain yet. To this end, this paper aims to evaluate the performance of ChatGPT on various benchmark biomedical tasks, such as relation extraction, document classification, question answering, and summarization. To the best of our knowledge, this is the first work that conducts an extensive evaluation of ChatGPT in the biomedical domain. Interestingly, we find based on our evaluation that in biomedical datasets that have smaller training sets, zero-shot ChatGPT even outperforms the state-of-the-art fine-tuned generative transformer models, such as BioGPT and BioBART. This suggests that ChatGPT’s pre-training on large text corpora makes it quite specialized even in the biomedical domain. Our findings demonstrate that ChatGPT has the potential to be a valuable tool for various tasks in the biomedical domain that lack large annotated data.

[1]  Shafiq R. Joty,et al.  A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets , 2023, ACL.

[2]  Nanyang Technological University,et al.  A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT , 2023, ArXiv.

[3]  Wei Cheng,et al.  Exploring the Limits of ChatGPT for Query or Aspect-based Text Summarization , 2023, ArXiv.

[4]  Michihiro Yasunaga,et al.  Is ChatGPT a General-Purpose Natural Language Processing Task Solver? , 2023, EMNLP.

[5]  Dan Su,et al.  A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity , 2023, IJCNLP.

[6]  Md Tahmid Rahman Laskar,et al.  Improving Named Entity Recognition in Telephone Conversations via Effective Active Learning with Human in the Loop , 2022, DASH.

[7]  Shenmin Zhang,et al.  BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining , 2022, Briefings Bioinform..

[8]  Md Tahmid Rahman Laskar,et al.  An Auto Encoder-based Dimensionality Reduction Technique for Efficient Entity Linking in Business Phone Conversations , 2022, SIGIR.

[9]  Md Tahmid Rahman Laskar,et al.  BLINK with Elasticsearch for Efficient Entity Linking in Business Conversations , 2022, NAACL.

[10]  Hongyi Yuan,et al.  Generative Biomedical Entity Linking via Knowledge Base-Guided Pre-training and Synonyms-Aware Fine-tuning , 2022, North American Chapter of the Association for Computational Linguistics.

[11]  Ruyi Gan,et al.  BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model , 2022, BIONLP.

[12]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[13]  Jimmy Xiangji Huang,et al.  Domain Adaptation with Pre-trained Transformers for Query-Focused Abstractive Text Summarization , 2021, Computational Linguistics.

[14]  Tao Qin,et al.  Discovering Drug-Target Interaction Knowledge from Biomedical Literature , 2021, Bioinform..

[15]  Xipeng Qiu,et al.  A Unified Generative Framework for Various NER Subtasks , 2021, ACL.

[16]  Katikapalli Subramanyam Kalyan,et al.  AMMU: A survey of transformer-based biomedical pretrained language models , 2021, J. Biomed. Informatics.

[17]  P. Xie,et al.  MedDialog: A Large-scale Medical Dialogue Dataset , 2020, EMNLP.

[18]  Jianfeng Gao,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[19]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[20]  Xiangji Huang,et al.  Contextualized Embeddings based Transformer Encoder for Sentence Similarity Modeling in Answer Selection Task , 2020, LREC.

[21]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[22]  Asma Ben Abacha,et al.  Question-driven summarization of answers to consumer health questions , 2020, Scientific Data.

[23]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[24]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[25]  William W. Cohen,et al.  PubMedQA: A Dataset for Biomedical Research Question Answering , 2019, EMNLP.

[26]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[27]  Asma Ben Abacha,et al.  On the Summarization of Consumer Health Questions , 2019, ACL.

[28]  Zhiyong Lu,et al.  Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets , 2019, BioNLP@ACL.

[29]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[30]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[31]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[32]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[33]  Steven Bethard,et al.  A Survey on Recent Advances in Named Entity Recognition from Deep Learning models , 2018, COLING.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[36]  Anna Korhonen,et al.  Automatic semantic classification of scientific literature according to the hallmarks of cancer , 2016, Bioinform..

[37]  Paloma Martínez,et al.  The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[38]  Zhoujun Li,et al.  A Survival Modeling Approach to Biomedical Search Result Diversification Using Wikipedia , 2010, IEEE Transactions on Knowledge and Data Engineering.

[39]  Qinmin Hu,et al.  A bayesian learning approach to promoting diversity in ranking for biomedical information retrieval , 2009, SIGIR.

[40]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[41]  Dina Demner-Fushman,et al.  Overview of the MEDIQA 2021 Shared Task on Summarization in the Medical Domain , 2021, BIONLP.

[42]  Emilia Farcas,et al.  A Gradually Soft Multi-Task and Data-Augmented Approach to Medical Question Understanding , 2021, ACL.

[43]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[44]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[45]  Luo Si,et al.  York University at TREC 2007: Genomics Track , 2005, TREC.