Self-alignment Pre-training for Biomedical Entity Representations

Despite the widespread success of self-supervised learning via masked language models, learning representations directly from text to accurately capture complex and fine-grained semantic relationships in the biomedical domain remains as a challenge. Addressing this is of paramount importance for tasks such as entity linking where complex relational knowledge is pivotal. We propose SapBERT, a pre-training scheme based on BERT. It self-aligns the representation space of biomedical entities with a metric learning objective function leveraging UMLS, a collection of biomedical ontologies with >4M concepts. Our experimental results on six medical entity linking benchmarking datasets demonstrate that SapBERT outperforms many domain-specific BERT-based variants such as BioBERT, BlueBERT and PubMedBERT, achieving the state-of-the-art (SOTA) performances.

[1]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[3]  Matthew R. Scott,et al.  Multi-Similarity Loss With General Pair Weighting for Deep Metric Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Lei Yu,et al.  A Mutual Information Maximization Perspective of Language Representation Learning , 2019, ICLR.

[5]  Dustin Wright,et al.  NormCo: Deep Disease Normalization for Biomedical Knowledge Base Construction , 2019, AKBC.

[6]  Yonghwa Choi,et al.  A Neural Named Entity Recognition and Multi-Type Normalization Tool for Biomedical Text Mining , 2019, IEEE Access.

[7]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[8]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Zhiyong Lu,et al.  Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets , 2019, BioNLP@ACL.

[11]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database: update 2019 , 2018, Nucleic Acids Res..

[12]  Yichen Wei,et al.  Circle Loss: A Unified Perspective of Pair Similarity Optimization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Iryna Gurevych,et al.  MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer , 2020, EMNLP.

[14]  Steven Bethard,et al.  A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization , 2020, ACL.

[15]  Sergey I. Nikolenko,et al.  Medical concept normalization in social media posts with recurrent neural networks , 2018, J. Biomed. Informatics.

[16]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[17]  Xiaolong Wang,et al.  CNN-based ranking for biomedical entity normalization , 2017, BMC Bioinformatics.

[18]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[19]  Ellen M. Voorhees,et al.  Overview of the TREC 2014 Clinical Decision Support Track , 2014, TREC.

[20]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[21]  Nigel Collier,et al.  Adapting Phrase-based Machine Translation to Normalise Medical Terms in Social Media Messages , 2015, EMNLP.

[22]  Nigel Collier,et al.  Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation , 2016, ACL.

[23]  Hua Xu,et al.  BERT-based Ranking for Biomedical Entity Normalization , 2019, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[24]  Vincent Ng,et al.  Sieve-Based Entity Linking for the Biomedical Domain , 2015, ACL.

[25]  Li Wang,et al.  How Noisy Social Media Text, How Diffrnt Social Media Sources? , 2013, IJCNLP.

[26]  Elena Tutubalina,et al.  Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT-based Models , 2020, COLING.

[27]  Thomas Hofmann,et al.  End-to-End Neural Entity Linking , 2018, CoNLL.

[28]  Goran Glavas,et al.  Probing Pretrained Language Models for Lexical Semantics , 2020, EMNLP.

[29]  Kevin Donnelly,et al.  SNOMED-CT: The advanced terminology and coding system for eHealth. , 2006, Studies in health technology and informatics.

[30]  Xiaodong Liu,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[31]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[32]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Mark Dredze,et al.  Clinical Concept Linking with Contextualized Neural Representations , 2020, ACL.

[34]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[35]  Jaehoon Choi,et al.  BEST: Next-Generation Biomedical Entity Search Tool for Knowledge Discovery from Biomedical Literature , 2016, PloS one.

[36]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[37]  Nigel Collier,et al.  COMETA: A Corpus for Medical Entity Linking in the Social Media , 2020, EMNLP.

[38]  Geoffrey E. Hinton,et al.  Neighbourhood Components Analysis , 2004, NIPS.

[39]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[40]  Hongfang Liu,et al.  A Comparison of Word Embeddings for the Biomedical Natural Language Processing , 2018, J. Biomed. Informatics.

[41]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[42]  Donghui Li,et al.  MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts , 2019, AKBC.

[43]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[44]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[45]  Helen Chen,et al.  UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus , 2021, NAACL.

[46]  Silvio Savarese,et al.  Deep Metric Learning via Lifted Structured Feature Embedding , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Jaewoo Kang,et al.  Biomedical Entity Representations with Synonym Marginalization , 2020, ACL.

[48]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[49]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[50]  Thomas C. Wiegers,et al.  MEDIC: a practical disease vocabulary used at the Comparative Toxicogenomics Database , 2012, Database J. Biol. Databases Curation.

[51]  Yi Tay,et al.  Robust Representation Learning of Biomedical Names , 2019, ACL.

[52]  Zhiyong Lu,et al.  TaggerOne: joint named entity recognition and normalization with semi-Markov Models , 2016, Bioinform..

[53]  Yang Zhang,et al.  Bio-Megatron: Larger Biomedical Domain Language Model , 2020, EMNLP.