Unsupervised Pre-training for Biomedical Question Answering

We explore the suitability of unsupervised representation learning methods on biomedical text -- BioBERT, SciBERT, and BioSentVec -- for biomedical question answering. To further improve unsupervised representations for biomedical QA, we introduce a new pre-training task from unlabeled data designed to reason about biomedical entities in the context. Our pre-training method consists of corrupting a given context by randomly replacing some mention of a biomedical entity with a random entity mention and then querying the model with the correct entity mention in order to locate the corrupted part of the context. This de-noising task enables the model to learn good representations from abundant, unlabeled biomedical text that helps QA tasks and minimizes the train-test mismatch between the pre-training task and the downstream QA tasks by requiring the model to predict spans. Our experiments show that pre-training BioBERT on the proposed pre-training task significantly boosts performance and outperforms the previous best model from the 7th BioASQ Task 7b-Phase B challenge.

[1]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[2]  Qingyu Chen,et al.  BioWordVec, improving biomedical word embeddings with subword information and MeSH , 2019, Scientific Data.

[3]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[4]  Said Ouatik El Alaoui,et al.  A Biomedical Question Answering System in BioASQ 2017 , 2017, BioNLP.

[5]  Richard Socher,et al.  Unifying Question Answering and Text Classification via Span Extraction , 2019, ArXiv.

[6]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[7]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[8]  Roy Schwartz,et al.  Knowledge Enhanced Contextual Word Representations , 2019, EMNLP/IJCNLP.

[9]  Richard Socher,et al.  Unifying Question Answering, Text Classification, and Regression via Span Extraction , 2019 .

[10]  Jaewoo Kang,et al.  Pre-trained Language Model for Biomedical Question Answering , 2019, PKDD/ECML Workshops.

[11]  Jian Peng,et al.  emrQA: A Large Corpus for Question Answering on Electronic Medical Records , 2018, EMNLP.

[12]  Manoj Kumar Chinnakotla,et al.  IIITH at BioASQ Challange 2015 Task 3b: Bio-Medical Question Answering System , 2015, CLEF.

[13]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[14]  Yan Li,et al.  A generic retrieval system for biomedical literatures: USTB at BioASQ2015 Question Answering Task , 2015, CLEF.

[15]  Mariana L. Neves,et al.  Neural Domain Adaptation for Biomedical Question Answering , 2017, CoNLL.

[16]  Ludovic Denoyer,et al.  Unsupervised Question Answering by Cloze Translation , 2019, ACL.

[17]  Andrew McCallum,et al.  Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets , 2018, EMNLP.

[18]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[19]  Jianjun Hu,et al.  A Survey on Machine Reading Comprehension: Tasks, Evaluation Metrics, and Benchmark Datasets , 2020, Applied Sciences.

[20]  Yifan Peng,et al.  BioSentVec: creating sentence embeddings for biomedical texts , 2018, 2019 IEEE International Conference on Healthcare Informatics (ICHI).

[21]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[24]  William W. Cohen,et al.  PubMedQA: A Dataset for Biomedical Research Question Answering , 2019, EMNLP.

[25]  Weiming Zhang,et al.  Neural Machine Reading Comprehension: Methods and Trends , 2019, Applied Sciences.

[26]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[27]  Axel-Cyrille Ngonga Ngomo,et al.  BioASQ: A Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering , 2012, AAAI Fall Symposium: Information Retrieval and Knowledge Discovery in Biomedical Text.

[28]  Andrew McCallum,et al.  Simultaneously Linking Entities and Extracting Relations from Biomedical Text Without Mention-level Supervision , 2019, AAAI.

[29]  Jungyun Seo,et al.  KSAnswer: Question-answering System of Kangwon National University and Sogang University in the 2016 BioASQ Challenge , 2016 .

[30]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.