D3NER: biomedical named entity recognition using CRF‐biLSTM improved with fine‐tuned embeddings of various linguistic information

Motivation: Recognition of biomedical named entities in the textual literature is a highly challenging research topic with great interest, playing as the prerequisite for extracting huge amount of high‐valued biomedical knowledge deposited in unstructured text and transforming them into well‐structured formats. Long Short‐Term Memory (LSTM) networks have recently been employed in various biomedical named entity recognition (NER) models with great success. They, however, often did not take advantages of all useful linguistic information and still have many aspects to be further improved for better performance. Results: We propose D3NER, a novel biomedical named entity recognition (NER) model using conditional random fields and bidirectional long short‐term memory improved with fine‐tuned embeddings of various linguistic information. D3NER is thoroughly compared with seven very recent state‐of‐the‐art NER models, of which two are even joint models with named entity normalization (NEN), which was proven to bring performance improvements to NER. Experimental results on benchmark datasets, i.e. the BioCreative V Chemical Disease Relation (BC5 CDR), the NCBI Disease and the FSU‐PRGE gene/protein corpus, demonstrate the out‐performance and stability of D3NER over all compared models for chemical, gene/protein NER and over all models (without NEN jointed, as D3NER) for disease NER, in almost all cases. On the BC5 CDR corpus, D3NER achieves Symbol for the chemical and disease NER, respectively; while on the NCBI Disease corpus, its F1 for the disease NER is 84.41%. Its F1 for the gene/protein NER on FSU‐PRGE is 87.62%. Symbol. No caption available. Availability and implementation: Data and source code are available at: https://github.com/aidantee/D3NER. Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Hongfei Lin,et al.  An attention‐based BiLSTM‐CRF approach to document‐level chemical named entity recognition , 2018, Bioinform..

[2]  Rich Caruana,et al.  Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping , 2000, NIPS.

[3]  Elena Beisswanger,et al.  A Proposal for a Configurable Silver Standard , 2010, Linguistic Annotation Workshop.

[4]  Nigel Collier,et al.  Sieve-based coreference resolution enhances semi-supervised learning model for chemical-induced disease relation extraction , 2016, Database J. Biol. Databases Curation.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[7]  Alfonso Valencia,et al.  CHEMDNER: The drugs and chemical names extraction challenge , 2015, Journal of Cheminformatics.

[8]  Zhiyong Lu,et al.  TaggerOne: joint named entity recognition and normalization with semi-Markov Models , 2016, Bioinform..

[9]  Jari Björne,et al.  Deep Learning with Minimal Training Data: TurkuNLP Entry in the BioNLP Shared Task 2016 , 2016, BioNLP.

[10]  Zhiyong Lu,et al.  DNorm: disease name normalization with pairwise learning to rank , 2013, Bioinform..

[11]  Lyan Verwimp,et al.  Character-Word LSTM Language Models , 2017, EACL.

[12]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[13]  Tao Chen,et al.  Disease named entity recognition by combining conditional random fields and bidirectional recurrent neural networks , 2016, Database J. Biol. Databases Curation.

[14]  Yue Zhang,et al.  A transition‐based joint model for disease named entity recognition and normalization , 2017, Bioinform..

[15]  Maryam Habibi,et al.  Deep learning with word embeddings improves biomedical named entity recognition , 2017, Bioinform..

[16]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[17]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[18]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[19]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[20]  Tapio Salakoski,et al.  Distributional Semantics Resources for Biomedical Text Processing , 2013 .

[21]  David Martínez,et al.  Evaluating the state of the art in disorder recognition and normalization of the clinical narrative , 2014, J. Am. Medical Informatics Assoc..

[22]  Sunghwan Sohn,et al.  Abbreviation definition identification based on automatic precision estimates , 2008, BMC Bioinformatics.

[23]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[24]  Nigel Collier,et al.  The UET-CAM System in the BioCreAtIvE V CDR Task , 2015 .

[25]  Nigel Collier,et al.  Learning Orthographic Features in Bi-directional LSTM for Biomedical Named Entity Recognition , 2016, BioTxtM@COLING 2016.

[26]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[27]  Zhiyong Lu,et al.  Annotating chemicals , diseases and their interactions in biomedical literature , 2015 .

[28]  Zhiyong Lu,et al.  tmChem: a high performance approach for chemical named entity recognition and normalization , 2015, Journal of Cheminformatics.

[29]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[30]  José Luís Oliveira,et al.  Biomedical Named Entity Recognition: A Survey of Machine-Learning Tools , 2012 .

[31]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[32]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[33]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[34]  Yifan Peng,et al.  Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task , 2016, Database J. Biol. Databases Curation.

[35]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[36]  Hong Yu,et al.  Structured prediction models for RNN based sequence labeling in clinical text , 2016, EMNLP.

[37]  Wang Ling,et al.  Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[38]  Daniel M. Lowe,et al.  LeadMine : Disease identification and concept mapping using Wikipedia , 2015 .

[39]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[40]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..