论文信息 - Unsupervised Pre-training for Biomedical Question Answering - 字舞流文

Unsupervised Pre-training for Biomedical Question Answering

We explore the suitability of unsupervised representation learning methods on biomedical text -- BioBERT, SciBERT, and BioSentVec -- for biomedical question answering. To further improve unsupervised representations for biomedical QA, we introduce a new pre-training task from unlabeled data designed to reason about biomedical entities in the context. Our pre-training method consists of corrupting a given context by randomly replacing some mention of a biomedical entity with a random entity mention and then querying the model with the correct entity mention in order to locate the corrupted part of the context. This de-noising task enables the model to learn good representations from abundant, unlabeled biomedical text that helps QA tasks and minimizes the train-test mismatch between the pre-training task and the downstream QA tasks by requiring the model to predict spans. Our experiments show that pre-training BioBERT on the proposed pre-training task significantly boosts performance and outperforms the previous best model from the 7th BioASQ Task 7b-Phase B challenge.

Kun Li | Trapit Bansal | Andrew McCallum | Ivana Williams | Vaishnavi Kommaraju | Karthick Gunasekaran | Ana-Maria Istrate | A. McCallum | Trapit Bansal | Kun Li | Ana-Maria Istrate | Ivana Williams | Vaishnavi Kommaraju | K. Gunasekaran

[1] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[2] Qingyu Chen,et al. BioWordVec, improving biomedical word embeddings with subword information and MeSH , 2019, Scientific Data.

[3] Zhiyong Lu,et al. PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[4] Said Ouatik El Alaoui,et al. A Biomedical Question Answering System in BioASQ 2017 , 2017, BioNLP.

[5] Richard Socher,et al. Unifying Question Answering and Text Classification via Span Extraction , 2019, ArXiv.

[6] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[7] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[8] Roy Schwartz,et al. Knowledge Enhanced Contextual Word Representations , 2019, EMNLP/IJCNLP.

[9] Richard Socher,et al. Unifying Question Answering, Text Classification, and Regression via Span Extraction , 2019 .

[10] Jaewoo Kang,et al. Pre-trained Language Model for Biomedical Question Answering , 2019, PKDD/ECML Workshops.

[11] Jian Peng,et al. emrQA: A Large Corpus for Question Answering on Electronic Medical Records , 2018, EMNLP.

[12] Manoj Kumar Chinnakotla,et al. IIITH at BioASQ Challange 2015 Task 3b: Bio-Medical Question Answering System , 2015, CLEF.

[13] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[14] Yan Li,et al. A generic retrieval system for biomedical literatures: USTB at BioASQ2015 Question Answering Task , 2015, CLEF.

[15] Mariana L. Neves,et al. Neural Domain Adaptation for Biomedical Question Answering , 2017, CoNLL.

[16] Ludovic Denoyer,et al. Unsupervised Question Answering by Cloze Translation , 2019, ACL.

[17] Andrew McCallum,et al. Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets , 2018, EMNLP.

[18] Maosong Sun,et al. ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[19] Jianjun Hu,et al. A Survey on Machine Reading Comprehension: Tasks, Evaluation Metrics, and Benchmark Datasets , 2020, Applied Sciences.

[20] Yifan Peng,et al. BioSentVec: creating sentence embeddings for biomedical texts , 2018, 2019 IEEE International Conference on Healthcare Informatics (ICHI).

[21] Iz Beltagy,et al. SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[22] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23] Omer Levy,et al. SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[24] William W. Cohen,et al. PubMedQA: A Dataset for Biomedical Research Question Answering , 2019, EMNLP.

[25] Weiming Zhang,et al. Neural Machine Reading Comprehension: Methods and Trends , 2019, Applied Sciences.

[26] Jaewoo Kang,et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[27] Axel-Cyrille Ngonga Ngomo,et al. BioASQ: A Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering , 2012, AAAI Fall Symposium: Information Retrieval and Knowledge Discovery in Biomedical Text.

[28] Andrew McCallum,et al. Simultaneously Linking Entities and Extracting Relations from Biomedical Text Without Mention-level Supervision , 2019, AAAI.

[29] Jungyun Seo,et al. KSAnswer: Question-answering System of Kangwon National University and Sogang University in the 2016 BioASQ Challenge , 2016 .

[30] Wei-Hung Weng,et al. Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.