CheNER: a tool for the identification of chemical entities and their classes in biomedical literature

BackgroundSmall chemical molecules regulate biological processes at the molecular level. Those molecules are often involved in causing or treating pathological states. Automatically identifying such molecules in biomedical text is difficult due to both, the diverse morphology of chemical names and the alternative types of nomenclature that are simultaneously used to describe them. To address these issues, the last BioCreAtIvE challenge proposed a CHEMDNER task, which is a Named Entity Recognition (NER) challenge that aims at labelling different types of chemical names in biomedical text.MethodsTo address this challenge we tested various approaches to recognizing chemical entities in biomedical documents. These approaches range from linear Conditional Random Fields (CRFs) to a combination of CRFs with regular expression and dictionary matching, followed by a post-processing step to tag those chemical names in a corpus of Medline abstracts. We named our best performing systems CheNER.ResultsWe evaluate the performance of the various approaches using the F-score statistics. Higher F-scores indicate better performance. The highest F-score we obtain in identifying unique chemical entities is 72.88%. The highest F-score we obtain in identifying all chemical entities is 73.07%. We also evaluate the F-Score of combining our system with ChemSpot, and find an increase from 72.88% to 73.83%.ConclusionsCheNER presents a valid alternative for automated annotation of chemical entities in biomedical documents. In addition, CheNER may be used to derive new features to train newer methods for tagging chemical entities. CheNER can be downloaded from http://metres.udl.cat and included in text annotation pipelines.

[1]  Zhiyong Lu,et al.  tmChem: a high performance approach for chemical named entity recognition and normalization , 2015, Journal of Cheminformatics.

[2]  Alfonso Valencia,et al.  CHEMDNER: The drugs and chemical names extraction challenge , 2015, Journal of Cheminformatics.

[3]  Madian Khabsa,et al.  An Ensemble Information Extraction Approach to the BioCreative CHEMDNER Task , 2013 .

[4]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[5]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[6]  Martin Hofmann-Apitius,et al.  Detection of IUPAC and IUPAC-like chemical names , 2008, ISMB.

[7]  Daniel M. Lowe,et al.  LeadMine : A grammar and dictionary driven approach to chemical entity recognition , 2013 .

[8]  Alfonso Valencia,et al.  CheNER: chemical named entity recognizer , 2014, Bioinform..

[9]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[10]  Zhiyong Lu,et al.  Overview of the BioCreative III Workshop , 2011, BMC Bioinformatics.

[11]  Isabel Segura-Bedmar,et al.  Drug name recognition and classification in biomedical texts. A case study outlining approaches underpinning automated systems. , 2008, Drug discovery today.

[12]  Alfonso Valencia,et al.  The Frame-Based Module of the SUISEKI Information Extraction System , 2002, IEEE Intell. Syst..

[13]  Daniel M. Lowe,et al.  LeadMine: a grammar and dictionary driven approach to entity recognition , 2015, Journal of Cheminformatics.

[14]  S. Bryant,et al.  PubChem as a public resource for drug discovery. , 2010, Drug discovery today.

[15]  Sophia Ananiadou,et al.  Chemistry-specific Features and Heuristics for Developing a CRF-based Chemical Named Entity Recogniser , 2013 .

[16]  A. Valencia,et al.  Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications , 2011, Molecular informatics.

[17]  Justin Zobel,et al.  NEROC : Named Entity Recognizer of Chemicals , 2013 .

[18]  Martijn J. Schuemie,et al.  A dictionary to identify small molecules and drugs in free text , 2009, Bioinform..

[19]  U. Leser,et al.  Extended Feature Set for Chemical Named Entity Recognition and Indexing , 2013 .

[20]  D. I. Cooke-Fox,et al.  Computer translation of IUPAC systematic organic chemical nomenclature. 2. Development of a formal grammar , 1989, J. Chem. Inf. Comput. Sci..

[21]  N. Laachfoubi EXTRACTING DRUG-DRUG INTERACTIONS FROM BIOMEDICAL TEXT USING A FEATURE-BASED KERNEL APPROACH , 2016 .

[22]  Sampo Pyysalo,et al.  Overview of BioNLP Shared Task 2013 , 2013, BioNLP@ACL.

[23]  Leonardo Campillos,et al.  Recognizing Chemical Compounds and Drugs : a Rule-Based Approach Using Semantic Information , 2013 .

[24]  D. I. Cooke-Fox,et al.  Computer translation of IUPAC systematic organic chemical nomenclature. 1. Introduction and background to a grammar-based approach , 1989, J. Chem. Inf. Comput. Sci..

[25]  Dietrich Rebholz-Schuhmann,et al.  Text processing through Web services: calling Whatizit , 2008, Bioinform..

[26]  César de Pablo-Sánchez,et al.  Resolving anaphoras for the extraction of drug-drug interactions in pharmacological documents , 2010, BMC Bioinformatics.

[27]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[28]  Ulf Leser,et al.  ChemSpot: a hybrid system for chemical named entity recognition , 2012, Bioinform..

[29]  Lishuang Li,et al.  Combining Machine Learning with Dictionary Lookup for Chemical Compound and Drug Name Recognition Task , 2013 .

[30]  Martin Hofmann-Apitius,et al.  Chemical Names: Terminological Resources and Corpora Annotation , 2008, LREC 2008.

[31]  A. Valencia,et al.  Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge , 2008, Genome Biology.

[32]  Paloma Martínez,et al.  The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[33]  Erik M. van Mulligen,et al.  Recognition of chemical entities: combining dictionary-based and grammar-based approaches , 2015, Journal of Cheminformatics.

[34]  Masaharu YOSHIOKA,et al.  Ensemble Approach to Extract Chemical Named Entity by Using Results of Multiple CNER Systems with Different Characteristic , 2013 .

[35]  Peter Murray-Rust,et al.  High-Throughput Identification of Chemistry in Life Science Texts , 2006, CompLife.

[36]  A Valencia,et al.  An Overview of BioCreative II.5 , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[37]  Egon L. Willighagen,et al.  OSCAR4: a flexible architecture for chemical text-mining , 2011, J. Cheminformatics.

[38]  Christoph Steinbeck,et al.  The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013 , 2012, Nucleic Acids Res..

[39]  Xiaolong Wang,et al.  A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature , 2015, Journal of Cheminformatics.

[40]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[41]  D. I. Cooke-Fox,et al.  Computer translation of IUPAC systematic organic chemical nomenclature. 4. Concise connection tables to structure diagrams , 1990, J. Chem. Inf. Comput. Sci..

[42]  Peter Murray-Rust,et al.  ChemicalTagger: A tool for semantic text-mining in chemistry , 2011, J. Cheminformatics.