论文信息 - Self-alignment Pre-training for Biomedical Entity Representations

Self-alignment Pre-training for Biomedical Entity Representations

Despite the widespread success of self-supervised learning via masked language models, learning representations directly from text to accurately capture complex and fine-grained semantic relationships in the biomedical domain remains as a challenge. Addressing this is of paramount importance for tasks such as entity linking where complex relational knowledge is pivotal. We propose SapBERT, a pre-training scheme based on BERT. It self-aligns the representation space of biomedical entities with a metric learning objective function leveraging UMLS, a collection of biomedical ontologies with >4M concepts. Our experimental results on six medical entity linking benchmarking datasets demonstrate that SapBERT outperforms many domain-specific BERT-based variants such as BioBERT, BlueBERT and PubMedBERT, achieving the state-of-the-art (SOTA) performances.

Marco Basaldella | Zaiqiao Meng | Fangyu Liu | Nigel Collier | Ehsan Shareghi

[1] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Iz Beltagy,et al. SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[3] Matthew R. Scott,et al. Multi-Similarity Loss With General Pair Weighting for Deep Metric Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Lei Yu,et al. A Mutual Information Maximization Perspective of Language Representation Learning , 2019, ICLR.

[5] Dustin Wright,et al. NormCo: Deep Disease Normalization for Biomedical Knowledge Base Construction , 2019, AKBC.

[6] Yonghwa Choi,et al. A Neural Named Entity Recognition and Multi-Type Normalization Tool for Biomedical Text Mining , 2019, IEEE Access.

[7] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[8] Wei-Hung Weng,et al. Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[9] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10] Zhiyong Lu,et al. Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets , 2019, BioNLP@ACL.

[11] Thomas C. Wiegers,et al. The Comparative Toxicogenomics Database: update 2019 , 2018, Nucleic Acids Res..

[12] Yichen Wei,et al. Circle Loss: A Unified Perspective of Pair Similarity Optimization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Iryna Gurevych,et al. MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer , 2020, EMNLP.

[14] Steven Bethard,et al. A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization , 2020, ACL.

[15] Sergey I. Nikolenko,et al. Medical concept normalization in social media posts with recurrent neural networks , 2018, J. Biomed. Informatics.

[16] Mona Attariyan,et al. Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[17] Xiaolong Wang,et al. CNN-based ranking for biomedical entity normalization , 2017, BMC Bioinformatics.

[18] Zhiyong Lu,et al. BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[19] Ellen M. Voorhees,et al. Overview of the TREC 2014 Clinical Decision Support Track , 2014, TREC.

[20] Zhiyong Lu,et al. NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[21] Nigel Collier,et al. Adapting Phrase-based Machine Translation to Normalise Medical Terms in Social Media Messages , 2015, EMNLP.

[22] Nigel Collier,et al. Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation , 2016, ACL.

[23] Hua Xu,et al. BERT-based Ranking for Biomedical Entity Normalization , 2019, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[24] Vincent Ng,et al. Sieve-Based Entity Linking for the Biomedical Domain , 2015, ACL.

[25] Li Wang,et al. How Noisy Social Media Text, How Diffrnt Social Media Sources? , 2013, IJCNLP.

[26] Elena Tutubalina,et al. Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT-based Models , 2020, COLING.

[27] Thomas Hofmann,et al. End-to-End Neural Entity Linking , 2018, CoNLL.

[28] Goran Glavas,et al. Probing Pretrained Language Models for Lexical Semantics , 2020, EMNLP.

[29] Kevin Donnelly,et al. SNOMED-CT: The advanced terminology and coding system for eHealth. , 2006, Studies in health technology and informatics.

[30] Xiaodong Liu,et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[31] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[32] Kaiming He,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Mark Dredze,et al. Clinical Concept Linking with Contextualized Neural Representations , 2020, ACL.

[34] Jaewoo Kang,et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[35] Jaehoon Choi,et al. BEST: Next-Generation Biomedical Entity Search Tool for Knowledge Discovery from Biomedical Literature , 2016, PloS one.