Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning

Medicinal chemistry patents contain rich information about chemical compounds. Although much effort has been devoted to extracting chemical entities from scientific literature, limited numbers of patent mining systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of information extraction systems for medicinal chemistry patents, the 2015 BioCreative V challenge organized a track on Chemical and Drug Named Entity Recognition from patent text (CHEMDNER patents). This track included three individual subtasks: (i) Chemical Entity Mention Recognition in Patents (CEMP), (ii) Chemical Passage Detection (CPD) and (iii) Gene and Protein Related Object task (GPRO). We participated in the two subtasks of CEMP and CPD using machine learning-based systems. Our machine learning-based systems employed the algorithms of conditional random fields (CRF) and structured support vector machines (SSVMs), respectively. To improve the performance of the NER systems, two strategies were proposed for feature engineering: (i) domain knowledge features of dictionaries, chemical structural patterns and semantic type information present in the context of the candidate chemical and (ii) unsupervised feature learning algorithms to generate word representation features by Brown clustering and a novel binarized Word embedding to enhance the generalizability of the system. Further, the system output for the CPD task was yielded based on the patent titles and abstracts with chemicals recognized in the CEMP task. The effects of the proposed feature strategies on both the machine learning-based systems were investigated. Our best system achieved the second best performance among 21 participating teams in CEMP with a precision of 87.18%, a recall of 90.78% and a F-measure of 88.94% and was the top performing system among nine participating teams in CPD with a sensitivity of 98.60%, a specificity of 87.21%, an accuracy of 94.75%, a Matthew’s correlation coefficient (MCC) of 88.24%, a precision at full recall (P_full_R) of 66.57% and an area under the precision-recall curve (AUC_PR) of 0.9347. The SSVM-based CEMP systems outperformed the CRF-based CEMP systems when using the same features. Features generated from both the domain knowledge and unsupervised learning algorithms significantly improved the chemical NER task on patents. Database URL: http:// database. oxfordjournals. org/ content/ 2016/ baw049

[1]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[2]  Michael F. Lynch,et al.  Extraction of Information from the Text of Chemical Patents. 1. Identification of Specific Chemical Names , 1998, J. Chem. Inf. Comput. Sci..

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[5]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[6]  Peter Murray-Rust,et al.  High-Throughput Identification of Chemistry in Life Science Texts , 2006, CompLife.

[7]  Nam Nguyen,et al.  Comparisons of sequence labeling algorithms and extensions , 2007, ICML '07.

[8]  P. Leeson,et al.  The influence of drug-like concepts on decision-making in medicinal chemistry , 2007, Nature Reviews Drug Discovery.

[9]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[10]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[11]  Dietrich Rebholz-Schuhmann,et al.  Text processing through Web services: calling Whatizit , 2008, Bioinform..

[12]  Martin Hofmann-Apitius,et al.  Detection of IUPAC and IUPAC-like chemical names , 2008, ISMB.

[13]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[14]  Yanli Wang,et al.  PubChem: a public information system for analyzing bioactivities of small molecules , 2009, Nucleic Acids Res..

[15]  Juan M. Corchado,et al.  Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living, 10th International Work-Conference on Artificial Neural Networks, IWANN 2009 Workshops, Salamanca, Spain, June 10-12, 2009. Proceedings, Part II , 2009, IWANN.

[16]  Thomas C. Wiegers,et al.  Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical–gene–disease networks , 2008, Nucleic Acids Res..

[17]  Dietrich Rebholz-Schuhmann,et al.  Identification of Chemical Entities in Patent Documents , 2009, IWANN.

[18]  Antony J. Williams,et al.  ChemSpider:: An Online Chemical Information Resource , 2010 .

[19]  Egon L. Willighagen,et al.  OSCAR4: a flexible architecture for chemical text-mining , 2011, J. Cheminformatics.

[20]  Tudor I. Oprea,et al.  Drug Repurposing from an Academic Perspective. , 2011, Drug discovery today. Therapeutic strategies.

[21]  Catia Pesquita,et al.  Chemical Entity Recognition and Resolution to ChEBI , 2012, ISRN bioinformatics.

[22]  Ulf Leser,et al.  ChemSpot: a hybrid system for chemical named entity recognition , 2012, Bioinform..

[23]  Francisco M. Couto,et al.  Identifying Chemical Entities based on ChEBI , 2012, ICBO.

[24]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[25]  Zhiyong Lu,et al.  NCBI at the BioCreative IV CHEMDNER Task : Recognizing chemical names in PubMed articles with tmChem , 2013 .

[26]  Naoaki Okazaki,et al.  Named entity recognition with multiple segment representations , 2013, Inf. Process. Manag..

[27]  Francisco M. Couto,et al.  Enhancement of Chemical Entity Identification in Text Using Semantic Similarity Validation , 2013, PloS one.

[28]  S. Sundararajan,et al.  An Empirical Evaluation of Sequence-Tagging Trainers , 2013, ArXiv.

[29]  A. Valencia,et al.  Overview of the chemical compound and drug name recognition ( CHEMDNER ) task , 2013 .

[30]  W. Scott Spangler,et al.  Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[31]  Hua Xu,et al.  A hybrid system for temporal information extraction from clinical text , 2013, J. Am. Medical Informatics Assoc..

[32]  Andre Lamurias,et al.  Chemical compound and drug name recognition using CRFs and semantic similarity based on ChEBI , 2013 .

[33]  David S. Wishart,et al.  DrugBank 4.0: shedding new light on drug metabolism , 2013, Nucleic Acids Res..

[34]  Xiaolong Wang,et al.  Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks , 2014, BioMed research international.

[35]  Daniel M. Lowe,et al.  Annotated Chemical Patent Corpus: A Gold Standard for Text Mining , 2014, PloS one.

[36]  Wanxiang Che,et al.  Revisiting Embedding Features for Simple Semi-supervised Learning , 2014, EMNLP.

[37]  Yaoyun Zhang,et al.  UTH_CCB: A report for SemEval 2014 – Task 7 Analysis of Clinical Text , 2014, *SEMEVAL.

[38]  Hidir Aras,et al.  Applications and Challenges of Text Mining with Patents , 2014, IPaMin@KONVENS.

[39]  Yaoyun Zhang,et al.  A Study of Neural Word Embeddings for Named Entity Recognition in Clinical Text , 2015, AMIA.

[40]  Isabel Segura-Bedmar,et al.  Combining Conditional Random Fields and Word Embeddings for the CHEMDNER-patents task , 2015 .

[41]  Daniel M. Lowe,et al.  LeadMine: a grammar and dictionary driven approach to entity recognition , 2015, Journal of Cheminformatics.

[42]  Gael Pérez Rodríguez,et al.  Overview of the CHEMDNER patents task , 2015 .

[43]  Xiaolong Wang,et al.  A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature , 2015, Journal of Cheminformatics.

[44]  Keun Ho Ryu,et al.  Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations , 2015, Journal of Cheminformatics.

[45]  João D. Ferreira,et al.  Improving chemical entity recognition through h-index based semantic similarity , 2015, Journal of Cheminformatics.

[46]  Zhiyong Lu,et al.  The CHEMDNER corpus of chemicals and drugs and its annotation principles , 2015, Journal of Cheminformatics.

[47]  Zhiyong Lu,et al.  tmChem: a high performance approach for chemical named entity recognition and normalization , 2015, Journal of Cheminformatics.

[48]  Lijun Zhu,et al.  Chemical and Biological Entity Recognition System from Patent Documents , 2015 .