Scientific Language Models for Biomedical Knowledge Base Completion: An Empirical Study

Biomedical knowledge graphs (KGs) hold rich information on entities such as diseases, drugs, and genes. Predicting missing links in these graphs can boost many important applications, such as drug design and repurposing. Recent work has shown that general-domain language models (LMs) can serve as “soft” KGs, and that they can be fine-tuned for the task of KG completion. In this work, we study scientific LMs for KG completion, exploring whether we can tap into their latent knowledge to enhance biomedical link prediction. We evaluate several domain-specific LMs, fine-tuning them on datasets centered on drugs and diseases that we represent as KGs and enrich with textual entity descriptions. We integrate the LM-based models with KG embedding models, using a router method that learns to assign each input example to either type of model and provides a substantial boost in performance. Finally, we demonstrate the advantage of LM models in the inductive setting with novel scientific entities.

[1]  Jianfeng Gao,et al.  Embedding Entities and Relations for Learning and Inference in Knowledge Bases , 2014, ICLR.

[2]  Roy Schwartz,et al.  Knowledge Enhanced Contextual Word Representations , 2019, EMNLP/IJCNLP.

[3]  Juan-Zi Li,et al.  Text-Enhanced Representation Learning for Knowledge Graph , 2016, IJCAI.

[4]  Jungyun Seo,et al.  Multi-Task Learning for Knowledge Graph Completion with Pre-trained Language Models , 2020, COLING.

[5]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[6]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[7]  Zhiyuan Liu,et al.  Representation Learning of Knowledge Graphs with Entity Descriptions , 2016, AAAI.

[8]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[9]  Guillaume Bouchard,et al.  Complex Embeddings for Simple Link Prediction , 2016, ICML.

[10]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[11]  J. Leskovec,et al.  Identification of disease treatment mechanisms through the multiscale interactome , 2020, Nature Communications.

[12]  Michael Gamon,et al.  Representing Text for Joint Embedding of Text and Knowledge Bases , 2015, EMNLP.

[13]  Vikram Nitin,et al.  Composition-based Multi-Relational Graph Convolutional Networks , 2020, ICLR.

[14]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[15]  Chirag J Patel,et al.  A standard database for drug repositioning , 2017, Scientific Data.

[16]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[17]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19]  Xiaodong Liu,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[20]  Max Welling,et al.  Modeling Relational Data with Graph Convolutional Networks , 2017, ESWC.

[21]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[22]  Jian-Yun Nie,et al.  RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space , 2018, ICLR.

[23]  Fei Sha,et al.  DOCENT: Learning Self-Supervised Entity Representations from Large Document Collections , 2021, Conference of the European Chapter of the Association for Computational Linguistics.

[24]  Andreas Bender,et al.  A Review of Biomedical Datasets Relating to Drug Discovery: A Knowledge Graph Perspective , 2021, Briefings in bioinformatics.

[25]  Chengsheng Mao,et al.  KG-BERT: BERT for Knowledge Graph Completion , 2019, ArXiv.

[26]  Carl Allen,et al.  Benchmark and Best Practices for Biomedical Knowledge Graph Embeddings , 2020, BIONLP.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Magbubah Essack,et al.  Application and evaluation of knowledge graph embeddings in biomedical data , 2021, PeerJ Comput. Sci..

[29]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[30]  Ola Engkvist,et al.  Understanding the Performance of Knowledge Graph Embeddings in Drug Discovery , 2021, ArXiv.

[31]  Jure Leskovec,et al.  Modeling polypharmacy side effects with graph convolutional networks , 2018, bioRxiv.

[32]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[33]  M. Lindsay Target discovery , 2003, Nature Reviews Drug Discovery.

[34]  Aidong Zhang,et al.  A survey on literature based discovery approaches in biomedical domain , 2019, J. Biomed. Informatics.

[35]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[36]  Daniel S. Himmelstein,et al.  Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Associated Genes , 2014, bioRxiv.

[37]  Fabio Petroni,et al.  How Context Affects Language Models' Factual Predictions , 2020, AKBC.

[38]  J. Barkin The Technical Details , 2006 .

[39]  Gerard de Melo,et al.  Explainable Link Prediction for Emerging Entities in Knowledge Graphs , 2020, SEMWEB.

[40]  Yaohang Li,et al.  Biomedical data and computational models for drug repositioning: a comprehensive review , 2020, Briefings Bioinform..

[41]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[42]  Paul Groth,et al.  Inductive Entity Representations from Text via Link Prediction , 2021, WWW.