Continual knowledge infusion into pre-trained biomedical language models

MOTIVATION Biomedical language models produce meaningful concept representations that are useful for a variety of biomedical natural language processing (bioNLP) applications such as named entity recognition, relationship extraction, and question answering. Recent research trends have shown that the contextualized language models (e.g., BioBERT, BioELMo) possess tremendous representational power and are able to achieve impressive accuracy gains. However, these models are still unable to learn high-quality representations for concepts with low context information (i.e., rare words). Infusing the complementary information from knowledge-bases (KBs) is likely to be helpful when the corpus-specific information is insufficient to learn robust representations. Moreover, as the biomedical domain contains numerous KBs, it is imperative to develop approaches that can integrate the KBs in a continual fashion. RESULTS We propose a new representation learning approach that progressively fuses the semantic information from multiple KBs into the pretrained biomedical language models. Since most of the KBs in the biomedical domain are expressed as parent-child hierarchies, we choose to model the hierarchical KBs and propose a new knowledge modeling strategy that encodes their topological properties at a granular level. Moreover, the proposed continual learning technique efficiently updates the concepts representations to accommodate the new knowledge whilst preserving the memory efficiency of contextualized language models. Altogether, the proposed approach generates knowledge-powered embeddings with high fidelity and learning efficiency. Extensive experiments conducted on bioNLP tasks validate the efficacy of the proposed approach and demonstrates its capability in generating robust concept representations. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Marc'Aurelio Ranzato,et al.  Gradient Episodic Memory for Continual Learning , 2017, NIPS.

[2]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[3]  Berry de Bruijn,et al.  Data and systems for medication-related text classification and concept normalization from Twitter: insights from the Social Media Mining for Health (SMM4H)-2017 shared task , 2018, J. Am. Medical Informatics Assoc..

[4]  Roy Schwartz,et al.  Knowledge Enhanced Contextual Word Representations , 2019, EMNLP/IJCNLP.

[5]  Honglak Lee,et al.  Online Incremental Feature Learning with Denoising Autoencoders , 2012, AISTATS.

[6]  Zhengya Sun,et al.  Multi-task Character-Level Attentional Networks for Medical Concept Normalization , 2018, Neural Processing Letters.

[7]  Guangxu Xun,et al.  Hypothesis Generation From Text Based On Co-Evolution Of Biomedical Concepts , 2019, KDD.

[8]  Stefan Wermter,et al.  Continual Lifelong Learning with Neural Networks: A Review , 2019, Neural Networks.

[9]  M. M. Malik,et al.  Data mining and predictive analytics applications for the delivery of healthcare services: a systematic literature review , 2016, Annals of Operations Research.

[10]  William W. Cohen,et al.  Probing Biomedical Embeddings from Language Models , 2019, Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for.

[11]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.

[12]  Qingyu Chen,et al.  BioWordVec, improving biomedical word embeddings with subword information and MeSH , 2019, Scientific Data.

[13]  Howard L. Bleich,et al.  Technical Milestone: Medical Subject Headings Used to Search the Biomedical Literature , 2001, J. Am. Medical Informatics Assoc..

[14]  Nigel Collier,et al.  Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation , 2016, ACL.

[15]  Aidong Zhang,et al.  Continual representation learning for evolving biomedical bipartite networks , 2021, Bioinform..

[16]  Anália Lourenço,et al.  Overview of the BioCreative VI chemical-protein interaction Track , 2017 .

[17]  Yandong Guo,et al.  Large Scale Incremental Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  John F. Hurdle,et al.  Measuring diagnoses: ICD code accuracy. , 2005, Health services research.

[19]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Aidong Zhang,et al.  MeSHProbeNet: a self-attentive probe net for MeSH indexing , 2019, Bioinform..

[21]  Yoav Shoham,et al.  SenseBERT: Driving Some Sense into BERT , 2019, ACL.

[22]  Nicolette de Keizer,et al.  Forty years of SNOMED: a literature review , 2008, BMC Medical Informatics Decis. Mak..

[23]  Anna Rumshisky,et al.  MCN: A comprehensive corpus for medical concept normalization , 2019, J. Biomed. Informatics.

[24]  Núria Queralt-Rosinach,et al.  Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research , 2014, BMC Bioinformatics.

[25]  Georgios Paliouras,et al.  Results of the Seventh Edition of the BioASQ Challenge , 2020, PKDD/ECML Workshops.

[26]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[27]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[28]  Slav Petrov,et al.  Temporal Analysis of Language through Neural Language Models , 2014, LTCSS@ACL.

[29]  Francisco M. Couto,et al.  BiOnt: Deep Learning Using Multiple Biomedical Ontologies for Relation Extraction , 2020, ECIR.

[30]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[31]  Simon Baker,et al.  Word embeddings for biomedical natural language processing: A survey , 2020, Lang. Linguistics Compass.

[32]  Steven Bethard,et al.  A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization , 2020, ACL.

[33]  Aidong Zhang,et al.  Interpretable Word Embeddings for Medical Domain , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[34]  Halil Kilicoglu,et al.  Semantic MEDLINE: An advanced information management application for biomedicine , 2011, Inf. Serv. Use.

[35]  Elena Tutubalina,et al.  Deep Neural Models for Medical Concept Normalization in User-Generated Texts , 2019, ACL.

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Xiaodong Liu,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[38]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[39]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[40]  Nicholas Jing Yuan,et al.  Integrating Graph Contextualized Knowledge into Pre-trained Language Models , 2019, FINDINGS.

[41]  Guangxu Xun,et al.  Knowledge-Guided Efficient Representation Learning for Biomedical Domain , 2021, KDD.

[42]  Yaohang Li,et al.  Biomedical data and computational models for drug repositioning: a comprehensive review , 2020, Briefings Bioinform..