Chemical databases: curation or integration by user-defined equivalence?

There is a wealth of valuable chemical information in publicly available databases for use by scientists undertaking drug discovery. However finite curation resource, limitations of chemical structure software and differences in individual database applications mean that exact chemical structure equivalence between databases is unlikely to ever be a reality. The ability to identify compound equivalence has been made significantly easier by the use of the International Chemical Identifier (InChI), a non-proprietary line-notation for describing a chemical structure. More importantly, advances in methods to identify compounds that are the same at various levels of similarity, such as those containing the same parent component or having the same connectivity, are now enabling related compounds to be linked between databases where the structure matches are not exact.

[1]  Yanli Wang,et al.  PubChem: Integrated Platform of Small Molecules and Biological Activities , 2008 .

[2]  John P. Overington,et al.  Open data for drug discovery: learning from the biological community. , 2012, Future medicinal chemistry.

[3]  Christoph Steinbeck,et al.  The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013 , 2012, Nucleic Acids Res..

[4]  Wendy A. Warr,et al.  Tautomerism in chemical information management systems , 2010, J. Comput. Aided Mol. Des..

[5]  Arthur Dalby,et al.  Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited , 1992, J. Chem. Inf. Comput. Sci..

[6]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[7]  Antony J. Williams,et al.  The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets , 2015, Journal of Cheminformatics.

[8]  Antony J. Williams,et al.  ChemSpider:: An Online Chemical Information Resource , 2010 .

[9]  David S. Wishart,et al.  DrugBank 3.0: a comprehensive resource for ‘Omics’ research on drugs , 2010, Nucleic Acids Res..

[10]  Antony J. Williams,et al.  Parallel Worlds of Public and Commercial Bioactive Chemistry Data , 2014, Journal of medicinal chemistry.

[11]  Robert Petryszak,et al.  UniChem: a unified chemical structure cross-referencing and identifier tracking system , 2013, Journal of Cheminformatics.

[12]  Sean Ekins,et al.  Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation. , 2012, Drug discovery today.

[13]  Alexander Tropsha,et al.  Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research , 2010, J. Chem. Inf. Model..

[14]  Stephen R. Heller,et al.  InChI - the worldwide chemical structure identifier standard , 2013, Journal of Cheminformatics.

[15]  Ruili Huang,et al.  The NCGC Pharmaceutical Collection: A Comprehensive Resource of Clinically Approved Drugs Enabling Repurposing and Chemical Genomics , 2011, Science Translational Medicine.

[16]  Sameer Velankar,et al.  PDBe: Protein Data Bank in Europe , 2010, Nucleic Acids Res..

[17]  Barend Mons,et al.  Open PHACTS: semantic interoperability for drug discovery. , 2012, Drug discovery today.

[18]  Ryan G. Coleman,et al.  ZINC: A Free Tool to Discover Chemistry for Biology , 2012, J. Chem. Inf. Model..

[19]  Sameer Velankar,et al.  PDBe: Protein Data Bank in Europe , 2009, Nucleic Acids Res..

[20]  Pekka Tiikkainen,et al.  Estimating Error Rates in Bioactivity Databases , 2013, J. Chem. Inf. Model..

[21]  Wolf-Dietrich Ihlenfeldt,et al.  Tautomerism in large databases , 2010, J. Comput. Aided Mol. Des..

[22]  George Papadatos,et al.  UniChem: extension of InChI-based compound mapping to salt, connectivity and stereochemistry layers , 2014, Journal of Cheminformatics.