论文信息 - An Experimental Evaluation of Transformer-based Language Models in the Biomedical Domain - 字舞流文

An Experimental Evaluation of Transformer-based Language Models in the Biomedical Domain

With the growing amount of text in health data, there have been rapid advances in large pre-trained models that can be applied to a wide variety of biomedical tasks with minimal task-specific modifications. Emphasizing the cost of these models, which renders technical replication challenging, this paper summarizes experiments conducted in replicating BioBERT and further pre-training and careful fine-tuning in the biomedical domain. We also investigate the effectiveness of domain-specific and domain-agnostic pre-trained models across downstream biomedical NLP tasks. Our finding confirms that pre-trained models can be impactful in some downstream NLP tasks (QA and NER) in the biomedical domain; however, this improvement may not justify the high cost of domain-specific pre-training.

Elham Dolatabadi | Michael Liu | Faiza Khan Khattak | Kuhan Wang | Shobhit Jain | Paul Grouchy | Nidhi Arora | Sedef Akinli Kocak | Max Tian | Hillary Ngai | Shobhit Jain | Paul Grouchy | E. Dolatabadi | Hillary Ngai | S. Kocak | Nidhi Arora | Michael Liu | Kuhan Wang | Max Tian

[1] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[2] Amy Neustein. Text Mining of Web-based Medical Content , 2014 .

[3] Frank Rudzicz,et al. A survey of word embeddings for clinical text , 2019, J. Biomed. Informatics X.

[4] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5] Jonathan Berant,et al. MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension , 2019, ACL.

[6] Georgios Balikas,et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , 2015, BMC Bioinformatics.

[7] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[8] Fei Wang,et al. Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec , 2017, BMC Medical Informatics and Decision Making.

[9] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[10] Nigel Collier,et al. Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[11] Zhiyong Lu,et al. NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[12] Yang Liu,et al. Fine-tune BERT for Extractive Summarization , 2019, ArXiv.

[13] Luca Foschini,et al. Reproducibility in Machine Learning for Health , 2019, RML@ICLR.

[14] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[15] 知秀柴田. 5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[16] N. Campbell. Genetic association database , 2004, Nature Reviews Genetics.

[17] Mirella Lapata,et al. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[18] Dogu Araci,et al. FinBERT: Financial Sentiment Analysis with Pre-trained Language Models , 2019, ArXiv.

[19] Jimmy J. Lin,et al. End-to-End Open-Domain Question Answering with BERTserini , 2019, NAACL.

[20] Christopher C. Yang,et al. A Framework for Developing and Evaluating Word Embeddings of Drug-named Entity , 2018, BioNLP.

[21] L. Jensen,et al. The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text , 2013, PloS one.

[22] Mariana L. Neves,et al. Neural Domain Adaptation for Biomedical Question Answering , 2017, CoNLL.

[23] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[24] Mirella Lapata,et al. Text Summarization with Pretrained Encoders , 2019, EMNLP.

[25] Xiaodong Liu,et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ArXiv.

[26] Jaewoo Kang,et al. Pre-trained Language Model for Biomedical Question Answering , 2019, PKDD/ECML Workshops.

[27] Gurpreet Singh Lehal,et al. A Survey of Text Summarization Extractive Techniques , 2010 .

[28] William W. Cohen,et al. PubMedQA: A Dataset for Biomedical Research Question Answering , 2019, EMNLP.

[29] Jaewoo Kang,et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[30] Noah A. Smith,et al. To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks , 2019, RepL4NLP@ACL.

[31] Ramesh Nallapati,et al. Domain Adaptation with BERT-based Domain Classification and Data Selection , 2019, EMNLP.

[32] Hongfang Liu,et al. A Comparison of Word Embeddings for the Biomedical Natural Language Processing , 2018, J. Biomed. Informatics.

[33] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.