CHEMDNER system with mixed conditional random fields and multi-scale word clustering

BackgroundThe chemical compound and drug name recognition plays an important role in chemical text mining, and it is the basis for automatic relation extraction and event identification in chemical information processing. So a high-performance named entity recognition system for chemical compound and drug names is necessary.MethodsWe developed a CHEMDNER system based on mixed conditional random fields (CRF) with word clustering for chemical compound and drug name recognition. For the word clustering, we used Brown's hierarchical algorithm and Skip-gram model based on deep learning with massive PubMed articles including titles and abstracts.ResultsThis system achieved the highest F-score of 88.20% for the CDI task and the second highest F-score of 87.11% for the CEM task in BioCreative IV. The performance was further improved by multi-scale clustering based on deep learning, achieving the F-score of 88.71% for CDI and 88.06% for CEM.ConclusionsThe mixed CRF model represents both the internal complexity and external contexts of the entities, and the model is integrated with word clustering to capture domain knowledge with PubMed articles including titles and abstracts. The domain knowledge helps to ensure the performance of the entity recognition, even without fine-grained linguistic features and manually designed rules.

[1]  L MercerRobert,et al.  Class-based n-gram models of natural language , 1992 .

[2]  Walter Daelemans,et al.  Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4 , 2003 .

[3]  Nigel Collier,et al.  Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications , 2004 .

[4]  Cheng-Ju Kuo,et al.  High-Recall Gene Mention Recognition by Unification of Multiple Backward Parsing Models , 2007 .

[5]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[6]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[7]  Hwee Tou Ng,et al.  Named Entity Recognition: A Maximum Entropy Approach Using Global Information , 2002, COLING.

[8]  Malvina Nissim,et al.  Exploring the boundaries: gene and protein identification in biomedical text , 2005, BMC Bioinformatics.

[9]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[10]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[11]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[12]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning of Conditional Random Fields , 2008, ACL.

[13]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[14]  Richard Tzong-Han Tsai,et al.  UvA-DARE ( Digital Academic Repository ) Overview of BioCreative II gene mention recognition , 2008 .

[15]  Dong Yu,et al.  Sequential Labeling Using Deep-Structured Conditional Random Fields , 2010, IEEE Journal of Selected Topics in Signal Processing.

[16]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[17]  Koby Crammer,et al.  Penn/Umass/CHOP Biocreative II systems , 2007 .

[18]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[19]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[20]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[21]  Scott Miller,et al.  Name Tagging with Word Clusters and Discriminative Training , 2004, NAACL.

[22]  Alfonso Valencia,et al.  CHEMDNER: The drugs and chemical names extraction challenge , 2015, Journal of Cheminformatics.