Knowledge-enhanced biomedical named entity recognition and normalization: application to proteins and genes

Background Automated biomedical named entity recognition and normalization serves as the basis for many downstream applications in information management. However, this task is challenging due to name variations and entity ambiguity. A biomedical entity may have multiple variants and a variant could denote several different entity identifiers. Results To remedy the above issues, we present a novel knowledge-enhanced system for protein/gene named entity recognition (PNER) and normalization (PNEN). On one hand, a large amount of entity name knowledge extracted from biomedical knowledge bases is used to recognize more entity variants. On the other hand, structural knowledge of entities is extracted and encoded as identifier (ID) embeddings, which are then used for better entity normalization. Moreover, deep contextualized word representations generated by pre-trained language models are also incorporated into our knowledge-enhanced system for modeling multi-sense information of entities. Experimental results on the BioCreative VI Bio-ID corpus show that our proposed knowledge-enhanced system achieves 0.871  F 1-score for PNER and 0.445  F 1-score for PNEN, respectively, leading to a new state-of-the-art performance. Conclusions We propose a knowledge-enhanced system that combines both entity knowledge and deep contextualized word representations. Comparison results show that entity knowledge is beneficial to the PNER and PNEN task and can be well combined with contextualized information in our system for further improvement.

[1]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[2]  Xiaolong Wang,et al.  CNN-based ranking for biomedical entity normalization , 2017, BMC Bioinformatics.

[3]  Hua Xu,et al.  Recognizing and Encoding Discorder Concepts in Clinical Text using Machine Learning and Vector Space Model , 2013, CLEF.

[4]  Po-Ting Lai,et al.  NERChem: adapting NERBio to chemical patents via full-token features and named entity feature with chemical sub-class composition , 2016, Database J. Biol. Databases Curation.

[5]  Quoc V. Le,et al.  Semi-Supervised Sequence Modeling with Cross-View Training , 2018, EMNLP.

[6]  Zhiyong Lu,et al.  tmChem: a high performance approach for chemical named entity recognition and normalization , 2015, Journal of Cheminformatics.

[7]  Hongfei Lin,et al.  An attention‐based BiLSTM‐CRF approach to document‐level chemical named entity recognition , 2018, Bioinform..

[8]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[9]  Han Zhao,et al.  Self-Adaptive Hierarchical Sentence Model , 2015, IJCAI.

[10]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[11]  Zhiyong Lu,et al.  TaggerOne: joint named entity recognition and normalization with semi-Markov Models , 2016, Bioinform..

[12]  Tapio Salakoski,et al.  Wide-scope biomedical named entity recognition and normalization with CRFs, fuzzy matching and character level modeling , 2018, Database J. Biol. Databases Curation.

[13]  Tapio Salakoski,et al.  Distributional Semantics Resources for Biomedical Text Processing , 2013 .

[14]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[15]  Tapio Salakoski,et al.  TurkuNLP Entry for Interactive Bio-ID Assignment , 2017 .

[16]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[17]  Zhiyong Lu,et al.  Challenges in clinical natural language processing for automated disorder normalization , 2015, J. Biomed. Informatics.

[18]  Xiaohui Liang,et al.  CHEMDNER system with mixed conditional random fields and multi-scale word clustering , 2015, Journal of Cheminformatics.

[19]  Ani Nenkova,et al.  Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , 2016, NAACL 2016.

[20]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[21]  Zhiyong Lu,et al.  GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains , 2015, BioMed research international.

[22]  José Luis Ambite,et al.  A Neural Named Entity Recognition Approach to Biological Entity Identification , 2017 .

[23]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[24]  Tianwei Yu,et al.  K-Profiles: A Nonlinear Clustering Method for Pattern Detection in High Dimensional Data , 2015, BioMed research international.

[25]  Steven Bethard,et al.  A Survey on Recent Advances in Named Entity Recognition from Deep Learning models , 2018, COLING.

[26]  Erik M. van Mulligen,et al.  Recognition of chemical entities: combining dictionary-based and grammar-based approaches , 2015, Journal of Cheminformatics.

[27]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[28]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[29]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[30]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[31]  Thomas Hofmann,et al.  Deep Joint Entity Disambiguation with Local Neural Attention , 2017, EMNLP.

[32]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[33]  Omer Levy,et al.  Named Entity Disambiguation for Noisy Text , 2017, CoNLL.

[34]  Ioannis Xenarios,et al.  SourceData: a semantic platform for curating and searching figures , 2016, Nature Methods.

[35]  Damian Szklarczyk,et al.  The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible , 2016, Nucleic Acids Res..

[36]  José Luís Oliveira,et al.  Biomedical Named Entity Recognition: A Survey of Machine-Learning Tools , 2012 .

[37]  Yaoyun Zhang,et al.  UTH_CCB: A report for SemEval 2014 – Task 7 Analysis of Clinical Text , 2014, *SEMEVAL.