PubChem chemical structure standardization

BackgroundPubChem is a chemical information repository, consisting of three primary databases: Substance, Compound, and BioAssay. When individual data contributors submit chemical substance descriptions to Substance, the unique chemical structures are extracted and stored into Compound through an automated process called structure standardization. The present study describes the PubChem standardization approaches and analyzes them for their success rates, reasons that cause structures to be rejected, and modifications applied to structures during the standardization process. Furthermore, the PubChem standardization is compared to the structure normalization of the IUPAC International Chemical Identifier (InChI) software, as manifested by conversion of the InChI back into a chemical structure.ResultsThe observed rejection rate for substances processed by PubChem standardization was 0.36%, which is predominantly attributed to structures with invalid atom valences that cannot be readily corrected without additional information from contributors. Of all structures that pass standardization, 44% are modified in the process, reducing the count of unique structures from 53,574,724 in substance to 45,808,881 in compound as identified by de-aromatized canonical isomeric SMILES. Even though the processing time is very low on average (only 0.4% of structures have individual standardization time above 0.1 s), total standardization time is completely dominated by edge cases: 90% of the time to standardize all structures in PubChem substance is spent on the 2.05% of structures with the highest individual standardization time. It is worth noting that 60% of the structures obtained from PubChem structure standardization are not identical to the chemical structure resulting from the InChI (primarily due to preferences for a different tautomeric form).ConclusionsStandardization of chemical structures is complicated by the diversity of chemical information and their representations approaches. The PubChem standardization is an effective and efficient tool to account for molecular diversity and to eliminate invalid/incomplete structures. Further development will concentrate on improved tautomer consideration and an expanded stereocenter definition. Modifications are difficult to thoroughly validate, with slight changes often affecting many thousands of structures and various edge cases. The PubChem structure standardization service is accessible as a public resource (https://pubchem.ncbi.nlm.nih.gov/standardize), and via programmatic interfaces.

[1]  Haijun Jiao,et al.  What is aromaticity? , 1996, J. Chem. Inf. Comput. Sci..

[2]  Antti Poso,et al.  The Effect of Ligand-Based Tautomer and Protomer Prediction on Structure-Based Virtual Screening , 2009, J. Chem. Inf. Model..

[3]  Paul M. Selzer,et al.  The Impact of Tautomer Forms on Pharmacophore-Based Virtual Screening , 2006, J. Chem. Inf. Model..

[4]  Matthias Rarey,et al.  NAOMI: On the Almost Trivial Task of Reading Molecules from Different File formats , 2011, J. Chem. Inf. Model..

[5]  John Figueras,et al.  Morgan revisited , 1993, J. Chem. Inf. Comput. Sci..

[6]  Alan Mcnaught,et al.  The IUPAC international chemical identifier : InChl-A new standard for molecular informatics , 2006 .

[7]  Elaine C. Meng,et al.  Determination of molecular topology and atomic hybridization states from heavy atom coordinates , 1991 .

[8]  Aug. Kekuié Untersuchungen über aromatische Verbindungen Ueber die Constitution der aromatischen Verbindungen. I. Ueber die Constitution der aromatischen Verbindungen. , 1866 .

[9]  José Elguero,et al.  The Tautomerism of heterocycles , 1976 .

[10]  David J. Wild,et al.  Grand challenges for cheminformatics , 2009, J. Cheminformatics.

[11]  F B ROGERS,et al.  Medical Subject Headings , 1948, Nature.

[12]  Evan Bolton,et al.  PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem , 2015, Nucleic Acids Res..

[13]  Andreas Bender,et al.  Recognizing Pitfalls in Virtual Screening: A Critical Review , 2012, J. Chem. Inf. Model..

[14]  Robert D. Clark,et al.  SYBYL Line Notation (SLN): A Single Notation To Represent Chemical Structures, Queries, Reactions, and Virtual Libraries , 2008, J. Chem. Inf. Model..

[15]  Evan Bolton,et al.  PubChem3D: Conformer generation , 2011, J. Cheminformatics.

[16]  J. Bajorath,et al.  Chemoinformatics: a view of the field and current trends in method development. , 2012, Bioorganic & medicinal chemistry.

[17]  Alexander Tropsha,et al.  Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research , 2010, J. Chem. Inf. Model..

[18]  Wendy A Warr,et al.  Some Trends in Chem(o)informatics. , 2011, Methods in molecular biology.

[19]  Manuel C. Peitsch,et al.  Building an R&D chemical registration system , 2012, Journal of Cheminformatics.

[20]  E. Hückel,et al.  Quantentheoretische Beiträge zum Benzolproblem , 1931 .

[21]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[22]  Igor I Baskin,et al.  Chemoinformatics as a Theoretical Chemistry Discipline , 2011, Molecular informatics.

[23]  Yvonne C. Martin,et al.  Let’s not forget tautomers , 2009, J. Comput. Aided Mol. Des..

[24]  Pekka Tiikkainen,et al.  Estimating Error Rates in Bioactivity Databases , 2013, J. Chem. Inf. Model..

[25]  Sergei V. Trepalin,et al.  Modular Chemical Descriptor Language (MCDL): Stereochemical modules , 2011, J. Cheminformatics.

[26]  Wendy A. Warr,et al.  Tautomerism in chemical information management systems , 2010, J. Comput. Aided Mol. Des..

[27]  M. Saladini,et al.  Solvent effect on keto–enol tautomerism in a new β-diketone: a comparison between experimental data and different theoretical approaches , 2011 .

[28]  J. Brecher Graphical representation of stereochemical configuration (IUPAC Recommendations 2006) , 2006 .

[29]  Evan Bolton,et al.  PubChem3D: conformer ensemble accuracy , 2013, Journal of Cheminformatics.

[30]  Robert Stevens,et al.  Structure-based classification and ontology in chemistry , 2012, Journal of Cheminformatics.

[31]  David Weininger,et al.  SMILES. 2. Algorithm for generation of unique SMILES notation , 1989, J. Chem. Inf. Comput. Sci..

[32]  Marko Razinger,et al.  Stereochemistry and sequence rules a proposal for modification of Cahn-Ingold-Prelog system , 1994 .

[33]  Fei Cai,et al.  An Alternative Strategy for Count and Storage of Kekulé and Longer Range Resonance Valence Bond Structures , 2005, J. Chem. Inf. Model..

[34]  Sunghwan Kim,et al.  Getting the most out of PubChem for virtual screening , 2016, Expert opinion on drug discovery.

[35]  Matthias Rarey,et al.  Systematic benchmark of substructure search in molecular graphs - From Ullmann to VF2 , 2012, Journal of Cheminformatics.

[36]  Thomas Engel,et al.  Basic Overview of Chemoinformatics , 2006, J. Chem. Inf. Model..

[37]  A. Simas,et al.  Importance of tautomers in the chemical behavior of tetracyclinesdagger. , 1999, Journal of pharmaceutical sciences.

[38]  W H De Camp,et al.  Specification of molecular chirality. , 1989, Chirality.

[39]  Wolf-Dietrich Ihlenfeldt,et al.  Computation and management of chemical properties in CACTVS: An extensible networked approach toward modularity and compatibility , 1994, J. Chem. Inf. Comput. Sci..

[40]  Noel M. O'Boyle Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI , 2012, Journal of Cheminformatics.

[41]  T. Cieplak,et al.  A New Effective Algorithm for the Unambiguous Identification of the Stereochemical Characteristics of Compounds During Their Registration in Databases , 2001, Molecules : A Journal of Synthetic Chemistry and Natural Product Chemistry.

[42]  Evan Bolton,et al.  PubChem3D: a new resource for scientists , 2011, J. Cheminformatics.

[43]  Matteo Floris,et al.  MMsDusty: an Alternative InChI‐Based Tool to Minimize Chemical Redundancy , 2013, Molecular informatics.

[44]  J McEntyre,et al.  PubMed: bridging the information gap. , 2001, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[45]  W. H. Powell,et al.  A guide to IUPAC nomenclature of organic compounds : recommendations 1993 , 1994 .

[46]  Marvin Waldman,et al.  Lions and tigers and bears, oh my! Three barriers to progress in computer-aided molecular design , 2011, Journal of Computer-Aided Molecular Design.

[47]  Wolf-Dietrich Ihlenfeldt,et al.  Tautomerism in large databases , 2010, J. Comput. Aided Mol. Des..

[48]  D. Young,et al.  Are the Chemical Structures in Your QSAR Correct , 2008 .

[49]  Tudor I. Oprea,et al.  WOMBAT: World of Molecular Bioactivity , 2005 .

[50]  Sergei V. Trepalin,et al.  Advanced Exact Structure Searching in Large Databases of Chemical Compounds , 2003, J. Chem. Inf. Comput. Sci..

[51]  A. Katritzky,et al.  To what extent can aromaticity be defined uniquely? , 2002, The Journal of organic chemistry.

[52]  Yvonne C. Martin,et al.  Application of Belief Theory to Similarity Data Fusion for Use in Analog Searching and Lead Hopping , 2008, J. Chem. Inf. Model..

[53]  Friedrich Rippmann,et al.  BALI: Automatic Assignment of Bond and Atom Types for Protein Ligands in the Brookhaven Protein Databank , 1997, J. Chem. Inf. Comput. Sci..

[54]  Nikolay P. Todorov,et al.  The Influence of Variations of Ligand Protonation and Tautomerism on Protein-Ligand Recognition and Binding Energy Landscape , 2006, J. Chem. Inf. Model..

[55]  Jan A. Kors,et al.  Consistency of systematic chemical identifiers within and between small-molecule databases , 2012, Journal of Cheminformatics.

[56]  Queen Mary,et al.  CORRECTIONS TO A GUIDE TO IUPAC NOMENCLATURE OF ORGANIC COMPOUNDS (IUPAC RECOMMENDATIONS 1993) , 1999 .

[57]  Alex M. Clark,et al.  Accurate Specification of Molecular Structures: The Case for Zero-Order Bonds and Explicit Hydrogen Counting , 2011, J. Chem. Inf. Model..

[58]  Yanli Wang,et al.  PubChem BioAssay: 2017 update , 2016, Nucleic Acids Res..

[59]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[60]  H. L. Morgan The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. , 1965 .

[61]  B. Donova-Jerman,et al.  Computer-aided enumeration and generation of the Kekulé structures in conjugated hydrocarbons , 1982, Comput. Chem..

[62]  N. Trinajstic,et al.  Computer-aided enumeration and generation of the kekulé structures in conjugated hydrocarbons , 1982 .

[63]  Peter Ertl,et al.  Molecular structure input on the web , 2010, J. Cheminformatics.

[64]  Zahid Rashid,et al.  Generation of Kekulé valence structures and the corresponding valence bond wave function , 2011, J. Comput. Chem..

[65]  Wendy A. Warr,et al.  Representation of chemical structures , 2011 .

[66]  Edward E. Hodgkin,et al.  Automatic assignment of chemical connectivity to organic molecules in the Cambridge Structural Database , 1992, J. Chem. Inf. Comput. Sci..

[67]  M. Randic Aromaticity of polycyclic conjugated hydrocarbons. , 2003, Chemical reviews.

[68]  Ivan Gutman,et al.  A new method for the enumeration of kekulé structures , 1987 .

[69]  Pierre Hansen,et al.  Assigning a Kekulé Structure to a Conjugated Molecule , 1995, Comput. Chem..

[70]  Johann Gasteiger,et al.  Hash codes for the identification and classification of molecular structure elements , 1994, J. Comput. Chem..

[71]  Matthias Rarey,et al.  The Valence State Combination Model: A Generic Framework for Handling Tautomers and Protonation States , 2014, J. Chem. Inf. Model..

[72]  R. Webster Homer,et al.  SYBYL Line Notation (SLN): A Versatile Language for Chemical Structure Representation , 1997, J. Chem. Inf. Comput. Sci..

[73]  R Green,et al.  Chemoinformatics--a new name for an old problem? , 1999, Current opinion in chemical biology.

[74]  Johann Gasteiger,et al.  Chemoinformatics: a new field with a long tradition , 2006, Analytical and bioanalytical chemistry.

[75]  Philip V. Toukach,et al.  Critical Analysis of CCSD Data Quality , 2012, J. Chem. Inf. Model..

[76]  B. Blessington A SERIOUS PROBLEM WITH COMPUTER PROCESSING OF STEREOCHEMISTRY IN CHEMICAL STRUCTURE FILES : THE NEED FOR STANDARDISATION , 1995 .

[77]  F. Brown Chapter 35 – Chemoinformatics: What is it and How does it Impact Drug Discovery. , 1998 .

[78]  W. C. Herndon Enumeration of resonance structures , 1973 .

[79]  Arthur Dalby,et al.  Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited , 1992, J. Chem. Inf. Comput. Sci..

[80]  Gisbert Schneider,et al.  Virtual screening: an endless staircase? , 2010, Nature Reviews Drug Discovery.

[81]  Matthias Rarey,et al.  Reading PDB: Perception of Molecules from 3D Atomic Coordinates , 2013, J. Chem. Inf. Model..

[82]  A. Stanger What is... aromaticity: a critique of the concept of aromaticity-can it really be defined? , 2009, Chemical communications.

[83]  Andrei A. Gakh,et al.  Modular Chemical Descriptor Language (MCDL): Composition, Connectivity, and Supplementary Modules , 2001, J. Chem. Inf. Comput. Sci..

[84]  A. H. Wapstra,et al.  The Nubase evaluation of nuclear and decay properties , 2003 .

[85]  Simon K. Kearsley A Quick Robust Method for Assigning a Kekulé Structure , 1993, Comput. Chem..

[86]  Yun Hee Jang,et al.  First principles calculations of the tautomers and pK(a) values of 8-oxoguanine: implications for mutagenicity and repair. , 2002, Chemical research in toxicology.

[87]  David Calkins,et al.  Towards the comprehensive, rapid, and accurate prediction of the favorable tautomeric states of drug-like molecules in aqueous solution , 2010, J. Comput. Aided Mol. Des..

[88]  Gerd Folkers,et al.  Tautomerism in Computer‐Aided Drug Design , 2003, Journal of receptor and signal transduction research.

[89]  Jonathan Brecher Graphical representation standards for chemical structure diagrams (IUPAC Recommendations 2008) , 2008 .

[90]  Morton E. Munk,et al.  Stereoisomer generation in computer-enhanced structure elucidation , 1993, J. Chem. Inf. Comput. Sci..

[91]  Man-Ling Lee,et al.  Handling of Tautomerism and Stereochemistry in Compound Registration , 2012, J. Chem. Inf. Model..

[92]  Alan R. Katritzky,et al.  Tautomerism in drug discovery , 2010, J. Comput. Aided Mol. Des..

[93]  Timo Böhme,et al.  Automated compound classification using a chemical ontology , 2012, Journal of Cheminformatics.

[94]  Erich Hckel,et al.  Quanstentheoretische Beitrge zum Benzolproblem: II. Quantentheorie der induzierten Polaritten , 1931 .

[95]  Loriano Storchi,et al.  Tautomer Enumeration and Stability Prediction for Virtual Screening on Large Chemical Databases , 2009, J. Chem. Inf. Model..

[96]  Roger A. Sayle,et al.  So you think you understand tautomerism? , 2010, J. Comput. Aided Mol. Des..

[97]  Milan Randić,et al.  Enumeration of the Kekulé structures in conjugated hydrocarbons , 1976 .

[98]  A. Michaelis Untersuchungen über aromatische Borverbindungen , 1894 .

[99]  Roman M. Balabin,et al.  Tautomeric equilibrium and hydrogen shifts in tetrazole and triazoles: focal-point analysis and ab initio limit. , 2009, The Journal of chemical physics.