A dictionary to identify small molecules and drugs in free text

MOTIVATION From the scientific community, a lot of effort has been spent on the correct identification of gene and protein names in text, while less effort has been spent on the correct identification of chemical names. Dictionary-based term identification has the power to recognize the diverse representation of chemical information in the literature and map the chemicals to their database identifiers. RESULTS We developed a dictionary for the identification of small molecules and drugs in text, combining information from UMLS, MeSH, ChEBI, DrugBank, KEGG, HMDB and ChemIDplus. Rule-based term filtering, manual check of highly frequent terms and disambiguation rules were applied. We tested the combined dictionary and the dictionaries derived from the individual resources on an annotated corpus, and conclude the following: (i) each of the different processing steps increase precision with a minor loss of recall; (ii) the overall performance of the combined dictionary is acceptable (precision 0.67, recall 0.40 (0.80 for trivial names); (iii) the combined dictionary performed better than the dictionary in the chemical recognizer OSCAR3; (iv) the performance of a dictionary based on ChemIDplus alone is comparable to the performance of the combined dictionary. AVAILABILITY The combined dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web site http://www.biosemantics.org/chemlist.

[1]  Fabien L. Gandon,et al.  Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy , 2009, BMC Bioinformatics.

[2]  Yu Xue,et al.  MBA: a literature mining system for extracting biomedical abbreviations , 2009, BMC Bioinformatics.

[3]  David B. Searls,et al.  Literature mining in support of drug discovery , 2008, Briefings Bioinform..

[4]  David S. Wishart,et al.  HMDB: a knowledgebase for the human metabolome , 2008, Nucleic Acids Res..

[5]  Isabel Segura-Bedmar,et al.  Drug name recognition and classification in biomedical texts. A case study outlining approaches underpinning automated systems. , 2008, Drug discovery today.

[6]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[7]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.

[8]  Martin Hofmann-Apitius,et al.  Detection of IUPAC and IUPAC-like chemical names , 2008, ISMB.

[9]  Peter T. Corbett,et al.  Cascaded classifiers for confidence-based chemical named entity recognition , 2008, BMC Bioinformatics.

[10]  Antony J Williams,et al.  Internet-based tools for communication and collaboration in chemistry. , 2008, Drug discovery today.

[11]  Antony J. Williams,et al.  A perspective of publicly accessible/open-access chemistry databases. , 2008, Drug discovery today.

[12]  C. Lee Giles,et al.  Mining, indexing, and searching for textual chemical molecule information on the web , 2008, WWW.

[13]  Peter Murray-Rust,et al.  Chemistry for everyone , 2008, Nature.

[14]  D. Rebholz-Schuhmann,et al.  Text processing through Web services: calling Whatizit , 2008, Bioinform..

[15]  Yoshihiro Yamanishi,et al.  KEGG for linking genomes to life and the environment , 2007, Nucleic Acids Res..

[16]  David S. Wishart,et al.  DrugBank: a knowledgebase for drugs, drug actions and drug targets , 2007, Nucleic Acids Res..

[17]  Hongfang Liu,et al.  A comparison study on algorithms of detecting long forms for short forms in biomedical text , 2007, BMC Bioinformatics.

[18]  Eneko Agirre,et al.  Word Sense Disambiguation: Algorithms and Applications , 2007 .

[19]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[20]  K. Bretonnel Cohen,et al.  Frontiers of biomedical text mining: current progress , 2007, Briefings Bioinform..

[21]  Juliane Fluck,et al.  Identification of new drug classification terms in textual resources , 2007, ISMB/ECCB.

[22]  Simone Teufel,et al.  Annotation of Chemical Named Entities , 2007, BioNLP@ACL.

[23]  K. Bretonnel Cohen,et al.  Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing , 2007 .

[24]  Martijn J. Schuemie,et al.  Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification , 2007, J. Biomed. Informatics.

[25]  C. Lee Giles,et al.  Extraction and search of chemical formulae in text documents on the web , 2007, WWW '07.

[26]  Martijn J. Schuemie,et al.  Peregrine: Lightweight gene name normalization by dictionary lookup , 2007 .

[27]  Hong Yu,et al.  Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles , 2007, J. Biomed. Informatics.

[28]  Peter Murray-Rust,et al.  High-Throughput Identification of Chemistry in Life Science Texts , 2006, CompLife.

[29]  Jonathan D. Wren,et al.  A scalable machine-learning approach to recognize chemical names within large text databases , 2006, BMC Bioinformatics.

[30]  A. Richard,et al.  Chemical structure indexing of toxicity data on the internet: moving toward a flat world. , 2006, Current opinion in drug discovery & development.

[31]  Christian Blaschke,et al.  Status of text-mining techniques applied to biomedical text. , 2006, Drug discovery today.

[32]  Corinna Kolárik,et al.  Information extraction in the life sciences: perspectives for medicinal chemistry, pharmacology and toxicology. , 2005, Current topics in medicinal chemistry.

[33]  Henry S. Rzepa,et al.  Chemistry in Bioinformatics , 2005, BMC Bioinformatics.

[34]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[35]  W. Hersh,et al.  A survey of current work in biomedical text mining , 2005, Briefings Bioinform..

[36]  Hiroshi Mamitsuka,et al.  A probabilistic model for mining implicit 'chemical compound-gene' relations from literature , 2005, ECCB/JBI.

[37]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[38]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[39]  Eugene M. Fluder,et al.  Text Influenced Molecular Indexing (TIMI): A Literature Database Mining Approach that Handles Text and Chemistry , 2003, J. Chem. Inf. Comput. Sci..

[40]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[41]  Matthew J. Walker,et al.  CKB - The Compound Knowledge Base: A Text Based Chemical Search System , 2002, J. Chem. Inf. Comput. Sci..

[42]  George Hripcsak,et al.  Mapping abbreviations to full forms in biomedical articles. , 2002, Journal of the American Medical Informatics Association : JAMIA.

[43]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[44]  Michael F. Lynch,et al.  Extraction of Information from the Text of Chemical Patents. 1. Identification of Specific Chemical Names , 1998, J. Chem. Inf. Comput. Sci..

[45]  D. W. Weisgerber Chemical Abstracts Service Chemical Registry System: history, scope, and impacts , 1997 .

[46]  Martin Hofmann-Apitius,et al.  Chemical Names: Terminological Resources and Corpora Annotation , 2008, LREC 2008.

[47]  Edda Klipp,et al.  SBMLmerge, a system for combining biochemical network models. , 2006, Genome informatics. International Conference on Genome Informatics.

[48]  D. Banville Mining chemical structural information from the drug literature. , 2006, Drug discovery today.

[49]  Barry Smith,et al.  Proceedings of the AMIA Symposium , 2005 .

[50]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[51]  Susumu Goto,et al.  LIGAND: database of chemical compounds and reactions in biological pathways , 2002, Nucleic Acids Res..

[52]  Olivier Bodenreider,et al.  Aggregating UMLS Semantic Types for Reducing Conceptual Complexity , 2001, MedInfo.

[53]  Olivier Bodenreider,et al.  Evaluating UMLS strings for natural language processing , 2001, AMIA.

[54]  Allen C. Browne,et al.  Analysis of biomedical text for chemical names: a comparison of three methods , 1999, AMIA.

[55]  Allen C. Browne,et al.  Lexical methods for managing variation in biomedical terminologies. , 1994, Proceedings. Symposium on Computer Applications in Medical Care.

[56]  A. Aronson Filtering the UMLS ® Metathesaurus ® for MetaMap , 1991 .

[57]  BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btm341 Databases and ontologies ChemDB update—full-text search and virtual chemical space , 2022 .