The CHEMDNER corpus of chemicals and drugs and its annotation principles

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/

Zhiyong Lu | Karin M. Verspoor | Hongfang Liu | Shuo Xu | K. E. Ravikumar | Keun Ho Ryu | Alfonso Valencia | Hua Xu | Paloma Martínez | Daniel M. Lowe | Roger A. Sayle | Richard Tzong-Han Tsai | Sérgio Matos | Isabel Segura-Bedmar | Buzhou Tang | Jan A. Kors | Madian Khabsa | C. Lee Giles | Masaharu Yoshioka | Marko Bajec | Hong-Jie Dai | Tolga Can | Tsendsuren Munkhdalai | Saber A. Akhondi | Francisco M. Couto | Tim Rocktäschel | Riza Theresa Batista-Navarro | David Salgado | Martin Krallinger | Matthias Irmer | Utpal Kumar Sikdar | Thaer M. Dieb | Florian Leitner | Andre Lamurias | Anabel Usie | Miguel Vazquez | Slavko Zitnik | Obdulia Rabal | Julen Oyarzabal | Lutz Weber | Dong-Hong Ji | Rafal Rak | Yanan Lu | Robert Leaman | Asif Ekbal | David Campos | S. V. Ramanan | P. Senthil Nathan | Xin An | Rui Alves | Torsten Huber | Miji Choi | Caglar Ata | A. Valencia | Tim Rocktäschel | Tsendsuren Munkhdalai | K. Ryu | Hongfang Liu | Hong-Jie Dai | Buzhou Tang | M. Bajec | Masaharu Yoshioka | L. Weber | Robert Leaman | Zhiyong Lu | F. Leitner | Martin Krallinger | Madian Khabsa | Sérgio Matos | David Campos | R. Sayle | David Salgado | T. M. Dieb | Andre Lamurias | F. Couto | J. Kors | O. Rabal | M. Vazquez | J. Oyarzábal | Asif Ekbal | R. Batista-Navarro | Isabel Segura-Bedmar | D. Ji | Rafal Rak | Xin An | Yanan Lu | Paloma Martínez | Torsten Huber | Anabel Usie | Shuo Xu | Matthias Irmer | Slavko Žitnik | S. Akhondi | Tolga Can | Rui Alves | Hua Xu | S. Ramanan | P. Nathan | M. Choi | C. Ata | H. Xu | M. Vázquez | Hua Xu | Hua Xu | H. Xu | Miji Choi | K. Ravikumar | Obdulia Rabal

[1]  Mariana L. Neves,et al.  A survey on annotation tools for the biomedical literature , 2014, Briefings Bioinform..

[2]  K. E. Ravikumar,et al.  A Biological Named Entity Recognizer , 2002, Pacific Symposium on Biocomputing.

[3]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[4]  Catia Pesquita,et al.  Chemical Entity Recognition and Resolution to ChEBI , 2012, ISRN bioinformatics.

[5]  Matthias Irmer,et al.  Creating a Gold Standard Corpus for the Extraction of Chemistry-Disease Relations from Patent Texts , 2014, LREC.

[6]  Zhiyong Lu,et al.  BioCreative-IV virtual issue , 2014, Database J. Biol. Databases Curation.

[7]  Thomas C. Rindflesch,et al.  EDGAR: extraction of drugs, genes and relations from the biomedical literature. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[8]  Simone Teufel,et al.  Flexible Interfaces in the Application of Language Technology to an eScience Corpus , 2006 .

[9]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[10]  Alfonso Valencia,et al.  Extraction of human kinase mutations from literature, databases and genotyping studies , 2009, BMC Bioinformatics.

[11]  David S. Wishart,et al.  DrugBank: a comprehensive resource for in silico drug discovery and exploration , 2005, Nucleic Acids Res..

[12]  Dietrich Rebholz-Schuhmann,et al.  The CALBC Silver Standard Corpus for Biomedical Named Entities - A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers , 2010, LREC.

[13]  A Valencia,et al.  An Overview of BioCreative II.5 , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  A. Valencia,et al.  Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications , 2011, Molecular informatics.

[15]  Jonathan D. Wren,et al.  A scalable machine-learning approach to recognize chemical names within large text databases , 2006, BMC Bioinformatics.

[16]  Brandon Barker,et al.  Genomic analysis of gene regulation complexity , 2008, BMC Bioinformatics.

[17]  Laura Inés Furlong,et al.  The EU-ADR corpus: Annotated drugs, diseases, targets, and their relationships , 2012, J. Biomed. Informatics.

[18]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[19]  Alfonso Valencia,et al.  CHEMDNER: The drugs and chemical names extraction challenge , 2015, Journal of Cheminformatics.

[20]  Erik M. van Mulligen,et al.  Training text chunkers on a silver standard corpus: can silver replace gold? , 2011, BMC Bioinformatics.

[21]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[22]  A. Valencia,et al.  The success (or not) of HUGO nomenclature , 2006, Genome Biology.

[23]  Allen C. Browne,et al.  Analysis of biomedical text for chemical names: a comparison of three methods , 1999, AMIA.

[24]  Alfonso Valencia,et al.  MyMiner: a web application for computer-assisted biocuration and text annotation , 2012, Bioinform..

[25]  Evan Bolton,et al.  Automated annotation of chemical names in the literature with tunable accuracy , 2011, J. Cheminformatics.

[26]  Daniel M. Lowe,et al.  Annotated Chemical Patent Corpus: A Gold Standard for Text Mining , 2014, PloS one.

[27]  Peter Murray-Rust,et al.  ChemicalTagger: A tool for semantic text-mining in chemistry , 2011, J. Cheminformatics.

[28]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[29]  Martin Krallinger,et al.  Analysis of biological processes and diseases using text mining approaches. , 2010, Methods in molecular biology.

[30]  Martin Hofmann-Apitius,et al.  Chemical Names: Terminological Resources and Corpora Annotation , 2008, LREC 2008.

[31]  M. He,et al.  PPI Finder: A Mining Tool for Human Protein-Protein Interactions , 2009, PloS one.

[32]  Jacques Ravel,et al.  Visualization of comparative genomic analyses by BLAST score ratio , 2005, BMC Bioinformatics.

[33]  Oliver Watts,et al.  Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) , 2014 .

[34]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.

[35]  Sophia Ananiadou,et al.  Mining metabolites: extracting the yeast metabolome from the literature , 2010, Metabolomics.

[36]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[37]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[38]  Alexander A. Morgan,et al.  BioCreAtIvE Task 1A: gene mention finding evaluation , 2005, BMC Bioinformatics.

[39]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[40]  Chris Sander,et al.  Introducing meta-services for biomedical information extraction , 2008, Genome Biology.

[41]  Karin M. Verspoor,et al.  BioC: a minimalist approach to interoperability for biomedical text processing , 2013, AMIA.

[42]  Nancy Ide,et al.  Representing Linguistic Corpora and Their Annotations , 2006, LREC.

[43]  Martin H. Schaefer,et al.  MedlineRanker: flexible ranking of biomedical literature , 2009, Nucleic Acids Res..

[44]  Robert M. Seymour,et al.  Using large-scale perturbations in gene network reconstruction , 2005, BMC Bioinformatics.

[45]  Alexander A. Morgan,et al.  Overview of BioCreAtIvE task 1B: normalized gene lists , 2005, BMC Bioinformatics.

[46]  Paloma Martínez,et al.  The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[47]  Ulf Leser,et al.  What makes a gene name? Named entity recognition in the biomedical literature , 2005, Briefings Bioinform..

[48]  Peter T. Corbett,et al.  Cascaded classifiers for confidence-based chemical named entity recognition , 2008, BMC Bioinformatics.

[49]  Laura Inés Furlong,et al.  Assessment of NER solutions against the first and second CALBC Silver Standard Corpus , 2011, Semantic Mining in Biomedicine.

[50]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[51]  K. Bretonnel Cohen,et al.  Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing , 2007 .

[52]  Simone Teufel,et al.  Annotation of Chemical Named Entities , 2007, BioNLP@ACL.

[53]  Dietrich Rebholz-Schuhmann,et al.  Towards mature use of semantic resources for biomedical analyses , 2011, J. Biomed. Semant..