Challenges in mining the literature for chemical information

Chemical information extracted from the literature is of immense value for the pharmaceutical and chemical industries in many areas, including supporting drug discovery, manufacturing processes, or intellectual property protection. However, the exponential growth of the chemical literature has made it increasingly difficult for researchers to find the information they need within a reasonable time-frame. In order to address this issue, a large number of text mining approaches have been developed that can extract chemical information from different types of literature. But the lack of a single universal standard for chemical structure and nomenclature representation has posed significant challenges in mining the chemical information. Hence, a review on the current state of chemical text mining, problems confronted, solutions available, and future prospectus is presented.

[1]  A. Peter Johnson,et al.  Chemical literature data extraction: The CLiDE Project , 1993, J. Chem. Inf. Comput. Sci..

[2]  Thomas Klose,et al.  Text mining and visualization tools - Impressions of emerging capabilities , 2008 .

[3]  A. Peter Johnson,et al.  CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition , 2009, J. Chem. Inf. Model..

[4]  Matthew C. Swain Chemicalize.org , 2012, J. Chem. Inf. Model..

[5]  Peter Murray-Rust,et al.  Chemical Name to Structure: OPSIN, an Open Source Solution , 2011, J. Chem. Inf. Model..

[6]  Martin Hofmann-Apitius,et al.  Knowledge environments representing molecular entities for the virtual physiological human , 2008, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[7]  Igor V. Filippov,et al.  Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution , 2009, J. Chem. Inf. Model..

[8]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[9]  Alfonso Valencia,et al.  Text-mining approaches in molecular biology and biomedicine. , 2005, Drug discovery today.

[10]  Luca Toldo,et al.  Extraction of potential adverse drug events from medical case reports , 2012, Journal of biomedical semantics.

[11]  Dietrich Rebholz-Schuhmann,et al.  UKPMC: a full text article resource for the life sciences , 2011, Nucleic Acids Res..

[12]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[13]  E. Fluder,et al.  Latent semantic structure indexing (LaSSI) for defining chemical similarity. , 2001, Journal of medicinal chemistry.

[14]  David S. Wishart,et al.  DrugBank: a knowledgebase for drugs, drug actions and drug targets , 2007, Nucleic Acids Res..

[15]  Antony J. Williams,et al.  ChemSpider:: An Online Chemical Information Resource , 2010 .

[16]  Ying Chen,et al.  ChemBrowser: a flexible framework for mining chemical documents. , 2010, Advances in experimental medicine and biology.

[17]  Dietrich Rebholz-Schuhmann,et al.  EBIMed - text crunching to gather facts for proteins from Medline , 2007, Bioinform..

[18]  Johannes Goll,et al.  Protein interaction data curation: the International Molecular Exchange (IMEx) consortium , 2012, Nature Methods.

[19]  Martijn J. Schuemie,et al.  A dictionary to identify small molecules and drugs in free text , 2009, Bioinform..

[20]  Yanli Wang,et al.  PubChem: a public information system for analyzing bioactivities of small molecules , 2009, Nucleic Acids Res..

[21]  Richard Van Noorden Chemistry’s web of data expands , 2012, Nature.

[22]  Duangdao Wichadakul,et al.  ChemEx: information extraction system for chemical data curation , 2012, BMC Bioinformatics.

[23]  Carol Friedman,et al.  Introduction: named entity recognition in biomedicine , 2004, J. Biomed. Informatics.

[24]  Sean Ekins,et al.  Mobile apps for chemistry in the world of drug discovery. , 2011, Drug discovery today.

[25]  Antje Chang,et al.  BRENDA in 2013: integrated reactions, kinetic data, enzyme function data, improved disease classification: new options and contents in BRENDA , 2012, Nucleic Acids Res..

[26]  N R Smalheiser,et al.  Using ARROWSMITH: a computer-assisted approach to formulating and assessing scientific hypotheses. , 1998, Computer methods and programs in biomedicine.

[27]  Robert E. Stobaugh,et al.  The Chemical Abstracts Service Chemical Registry System. I. General Design , 1976, J. Chem. Inf. Comput. Sci..

[28]  Peter Murray-Rust,et al.  ChemicalTagger: A tool for semantic text-mining in chemistry , 2011, J. Cheminformatics.

[29]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[30]  Alexander Klenner,et al.  Large scale chemical patent mining with UIMA and UNICORE , 2012, Journal of Cheminformatics.

[31]  John M. Golden Construing Patent Claims According to Their 'Interpretive Community': A Call for an Attorney-Plus-Artisan Perspective , 2008 .

[32]  Vasta Bm Use of TOXLINE and CHEMLINE for retrieval of drug information. , 1975 .

[33]  Dietrich Rebholz-Schuhmann,et al.  Text processing through Web services: calling Whatizit , 2008, Bioinform..

[34]  Csongor Nyulas,et al.  BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications , 2011, Nucleic Acids Res..

[35]  Kazuhiro Saitou,et al.  Tunable Machine Vision-Based Strategy for Automated Annotation of Chemical Databases , 2009, J. Chem. Inf. Model..

[36]  D. Woods,et al.  Medline and Embase complement each other in literature searches , 1998, BMJ.

[37]  James W. Cooper,et al.  Text analytics for life science using the Unstructured Information Management Architecture , 2004, IBM Syst. J..

[38]  A. Valencia,et al.  Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications , 2011, Molecular informatics.

[39]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[40]  Xin Wen,et al.  BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities , 2006, Nucleic Acids Res..

[41]  J. Burnham Scopus database: a review , 2006, Biomedical digital libraries.

[42]  John M. Barnard,et al.  Chemical patent information systems , 2011 .

[43]  Ulf Leser,et al.  ChemSpot: a hybrid system for chemical named entity recognition , 2012, Bioinform..

[44]  Roger A. Sayle Foreign Language Translation of Chemical Nomenclature by Computer , 2009, J. Chem. Inf. Model..

[45]  Fumiyoshi Yamashita,et al.  Automated Extraction of Information from the Literature on Chemical-CYP3A4 Interactions , 2007, J. Chem. Inf. Model..

[46]  Michael F. Lynch,et al.  Generic chemical structures in patents (Markush structures): The research project at the University of Sheffield , 1986 .

[47]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[48]  Luís Torgo,et al.  Design of an end-to-end method to extract information from tables , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[49]  Christopher G. Chute,et al.  BioPortal: ontologies and integrated data resources at the click of a mouse , 2009, Nucleic Acids Res..

[50]  Peter Willett,et al.  Representation and Searching of Chemical-Structure Information in Patents , 2011, Current Challenges in Patent Information Retrieval.

[51]  Patricia Tomasulo,et al.  ChemIDplus-Super Source for Chemical and Drug Information , 2002, Medical reference services quarterly.

[52]  Randall Davis,et al.  ChemInk: a natural real-time recognition system for chemical drawings , 2011, IUI '11.

[53]  D. Banville Mining chemical structural information from the drug literature. , 2006, Drug discovery today.

[54]  K. Bretonnel Cohen,et al.  Getting Started in Text Mining , 2008, PLoS Comput. Biol..

[55]  Sorel Muresan,et al.  Comparing manual and automated extraction of chemical entities from documents , 2010, J. Cheminformatics.

[56]  P Wexler,et al.  TOXNET: an evolving web resource for toxicology and environmental health information. , 2001, Toxicology.

[57]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[58]  Ann M Richard,et al.  Chemical structure indexing of toxicity data on the internet: moving toward a flat world. , 2006, Current opinion in drug discovery & development.

[59]  H. Lowe,et al.  Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches. , 1994, JAMA.

[60]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[61]  Anthony J Williams,et al.  Public chemical compound databases. , 2008, Current opinion in drug discovery & development.

[62]  William Fisanick,et al.  The Chemical Abstract's Service generic chemical (Markush) structure storage and retrieval capability. 1. Basic concepts , 1990, J. Chem. Inf. Comput. Sci..

[63]  Dietrich Rebholz-Schuhmann,et al.  Calbc Silver Standard Corpus , 2010, J. Bioinform. Comput. Biol..

[64]  John M. Barnard,et al.  Recent and current developments in handling Markush structures from chemical patents , 2012, Journal of Cheminformatics.

[65]  J. Leon,et al.  Highlights of Drug Package Inserts and the Website DailyMed: The Need for Further Improvement in Package Inserts to Help Busy Prescribers , 2011 .

[66]  K. Bretonnel Cohen,et al.  Text mining and manual curation of chemical-gene-disease networks for the Comparative Toxicogenomics Database (CTD) , 2009, BMC Bioinformatics.

[67]  B. Gibb,et al.  Life, the Universe and nomenclature. , 2013, Nature chemistry.

[68]  David Milward,et al.  Ontology-Based Interactive Information Extraction From Scientific Abstracts , 2005, Comparative and functional genomics.

[69]  Martin Hofmann-Apitius,et al.  Detection of IUPAC and IUPAC-like chemical names , 2008, ISMB.

[70]  Gert Vriend,et al.  AsteriX: A Web Server To Automatically Extract Ligand Coordinates from Figures in PDF Articles , 2012, J. Chem. Inf. Model..

[71]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[72]  Wendy A. Warr,et al.  Representation of chemical structures , 2011 .

[73]  Joe R. McDaniel,et al.  Kekule: OCR-optical chemical (structure) recognition , 1992, J. Chem. Inf. Comput. Sci..

[74]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[75]  Alexander Tropsha,et al.  Cheminformatics analysis of assertions mined from literature that describe drug-induced liver injury in different species. , 2010, Chemical research in toxicology.

[76]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[77]  Roy M. Adams,et al.  Bases and types of names , 1972 .