Biomedical relation extraction with pre-trained language representations and minimal task-specific architecture

This paper presents our participation in the AGAC Track from the 2019 BioNLP Open Shared Tasks. We provide a solution for Task 3, which aims to extract "gene - function change - disease" triples, where "gene" and "disease" are mentions of particular genes and diseases respectively and "function change" is one of four pre-defined relationship types. Our system extends BERT (Devlin et al., 2018), a state-of-the-art language model, which learns contextual language representations from a large unlabelled corpus and whose parameters can be fine-tuned to solve specific tasks with minimal additional architecture. We encode the pair of mentions and their textual context as two consecutive sequences in BERT, separated by a special symbol. We then use a single linear layer to classify their relationship into five classes (four pre-defined, as well as 'no relation'). Despite considerable class imbalance, our system significantly outperforms a random baseline while relying on an extremely simple setup with no specially engineered features.

[1]  Andrew McCallum,et al.  Simultaneously Self-Attending to All Mentions for Full-Abstract Biological Relation Extraction , 2018, NAACL.

[2]  Guodong Zhou,et al.  Chemical-induced disease relation extraction with various linguistic features , 2016, Database J. Biol. Databases Curation.

[3]  Yijia Zhang,et al.  A hybrid model based on neural networks for biomedical relation extraction , 2018, J. Biomed. Informatics.

[4]  Sung-Pil Choi,et al.  Extraction of protein–protein interactions (PPIs) from the literature by deep convolutional neural networks with various feature embeddings , 2018, J. Inf. Sci..

[5]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[6]  Karin M. Verspoor,et al.  Convolutional neural networks for chemical-disease relation extraction are improved with character-based word embeddings , 2018, BioNLP.

[7]  Kyle Lo,et al.  SciBERT: Pretrained Contextualized Embeddings for Scientific Text , 2019, ArXiv.

[8]  Fei Li,et al.  A neural joint model for entity and relation extraction from biomedical text , 2017, BMC Bioinformatics.

[9]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[10]  Yifan Peng,et al.  Improving chemical disease relation extraction with rich features and weakly labeled data , 2016, Journal of Cheminformatics.

[11]  Pushpak Bhattacharyya,et al.  Relation Extraction : A Survey , 2017, ArXiv.

[12]  Jeffrey Ling,et al.  Matching the Blanks: Distributional Similarity for Relation Learning , 2019, ACL.

[13]  Yifan Peng,et al.  Chemical-protein relation extraction with ensembles of SVM, CNN, and RNN models , 2018, ArXiv.

[14]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[15]  Wilson L. Taylor,et al.  “Cloze Procedure”: A New Tool for Measuring Readability , 1953 .

[16]  Philippe Cudré-Mauroux,et al.  Relation Extraction Using Distant Supervision , 2018, ACM Comput. Surv..

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[19]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Anália Lourenço,et al.  Overview of the BioCreative VI chemical-protein interaction Track , 2017 .

[22]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[23]  Leonhard Hennig,et al.  Improving Relation Extraction by Pre-trained Language Representations , 2019, AKBC.

[24]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[25]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[26]  Yoav Goldberg,et al.  Assessing BERT's Syntactic Abilities , 2019, ArXiv.