SureChEMBL: a large-scale, chemically annotated patent document database

SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/.

[1]  J. Brecher Name=Struct: A Practical Approach to the Sorry State of Real-Life Chemical Nomenclature , 1999, J. Chem. Inf. Comput. Sci..

[2]  Mervyn Bregonje,et al.  Patents: A unique source for scientific technical information in chemistry related industry? , 2005 .

[3]  Robert Petryszak,et al.  UniChem: a unified chemical structure cross-referencing and identifier tracking system , 2013, Journal of Cheminformatics.

[4]  George Papadatos,et al.  UniChem: extension of InChI-based compound mapping to salt, connectivity and stereochemistry layers , 2014, Journal of Cheminformatics.

[5]  Michael Schroeder,et al.  Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed , 2013, Journal of Biomedical Semantics.

[6]  Sorel Muresan,et al.  Tracking 20 Years of Compound-to-Target Output from Literature and Patents , 2013, PloS one.

[7]  Daniel M. Lowe,et al.  Annotated Chemical Patent Corpus: A Gold Standard for Text Mining , 2014, PloS one.

[8]  Stephen R. Heller,et al.  InChI - the worldwide chemical structure identifier standard , 2013, Journal of Cheminformatics.

[9]  Hiroaki Wakabayashi,et al.  Predicting Key Example Compounds in Competitors' Patent Applications Using Structural Information Alone , 2008, J. Chem. Inf. Model..

[10]  A. Peter Johnson,et al.  CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition , 2009, J. Chem. Inf. Model..

[11]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings , 1997 .

[12]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[13]  Jonas Boström,et al.  Exploiting Structural Information in Patent Specifications for Key Compound Prediction , 2012, J. Chem. Inf. Model..

[14]  Yanli Wang,et al.  PubChem: a public information system for analyzing bioactivities of small molecules , 2009, Nucleic Acids Res..

[15]  Roger A. Sayle,et al.  Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction , 2012, J. Chem. Inf. Model..

[16]  Jason G. Kettle,et al.  Data-mining patent literature for novel chemical reagents for use in medicinal chemistry design , 2010 .

[17]  George Papadatos,et al.  The ChEMBL bioactivity database: an update , 2013, Nucleic Acids Res..

[18]  Christoph Steinbeck,et al.  Chemical Entities of Biological Interest: an update , 2009, Nucleic Acids Res..

[19]  Peter Murray-Rust,et al.  Chemical Name to Structure: OPSIN, an Open Source Solution , 2011, J. Chem. Inf. Model..

[20]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[21]  Igor Jurisica,et al.  SCRIPDB: a portal for easy access to syntheses, chemicals and reactions in patents , 2011, Nucleic Acids Res..

[22]  Christopher L. Benson,et al.  Correction: Quantitative Determination of Technological Improvement from Patent Data , 2016, PloS one.

[23]  John M. Barnard,et al.  Chemical patent information systems , 2011 .

[24]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[25]  Daniel M. Lowe,et al.  Development of a Novel Fingerprint for Chemical Reactions and Its Application to Large-Scale Reaction Classification and Similarity , 2015, J. Chem. Inf. Model..

[26]  Sophia Ananiadou,et al.  Europe PMC: a full-text literature database for the life sciences and platform for innovation , 2014, Nucleic Acids Res..