Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research

Molecular modelers and cheminformaticians typically analyze experimental data generated by other scientists. Consequently, when it comes to data accuracy, cheminformaticians are always at the mercy of data providers who may inadvertently publish (partially) erroneous data. Thus, dataset curation is crucial for any cheminformatics analysis such as similarity searching, clustering, QSAR modeling, virtual screening, etc., especially nowadays when the availability of chemical datasets in public domain has skyrocketed in recent years. Despite the obvious importance of this preliminary step in the computational analysis of any dataset, there appears to be no commonly accepted guidance or set of procedures for chemical data curation. The main objective of this paper is to emphasize the need for a standardized chemical data curation strategy that should be followed at the onset of any molecular modeling investigation. Herein, we discuss several simple but important steps for cleaning chemical records in a database including the removal of a fraction of the data that cannot be appropriately handled by conventional cheminformatics techniques. Such steps include the removal of inorganic and organometallic compounds, counterions, salts and mixtures; structure validation; ring aromatization; normalization of specific chemotypes; curation of tautomeric forms; and the deletion of duplicates. To emphasize the importance of data curation as a mandatory step in data analysis, we discuss several case studies where chemical curation of the original “raw” database enabled the successful modeling study (specifically, QSAR analysis) or resulted in a significant improvement of model's prediction accuracy. We also demonstrate that in some cases rigorously developed QSAR models could be even used to correct erroneous biological data associated with chemical compounds. We believe that good practices for curation of chemical records outlined in this paper will be of value to all scientists working in the fields of molecular modeling, cheminformatics, and QSAR studies.

[1]  Tudor I. Oprea,et al.  Target, chemical and bioactivity databases – integration is key , 2006 .

[2]  J. Brecher Name=Struct: A Practical Approach to the Sorry State of Real-Life Chemical Nomenclature , 1999, J. Chem. Inf. Comput. Sci..

[3]  Victor Kuzmin,et al.  Hierarchical QSAR technology based on the Simplex representation of molecular structure , 2008, J. Comput. Aided Mol. Des..

[4]  E. Muratov,et al.  Quantitative structure-activity relationship studies of [(biphenyloxy)propyl]isoxazole derivatives. Inhibitors of human rhinovirus 2 replication. , 2007, Journal of medicinal chemistry.

[5]  Alexander Golbraikh,et al.  Predictive QSAR modeling workflow, model applicability domains, and virtual screening. , 2007, Current pharmaceutical design.

[6]  K. C. Pugh,et al.  Toxicity and physical properties of atrazine and its degradation products: A literature survey , 1994 .

[7]  Sorel Muresan,et al.  Quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compounds , 2009, J. Cheminformatics.

[8]  Haruki Nakamura,et al.  Data Deposition and Annotation at the Worldwide Protein Data Bank , 2009, Molecular biotechnology.

[9]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[10]  Yi Li,et al.  In silico ADME/Tox: why models fail , 2003, J. Comput. Aided Mol. Des..

[11]  Michiko Amano,et al.  Novel Method for the Evaluation of 3D Conformation Generators , 2009, J. Chem. Inf. Model..

[12]  Alexander Tropsha,et al.  Cheminformatics analysis of assertions mined from literature that describe drug-induced liver injury in different species. , 2010, Chemical research in toxicology.

[13]  Alexandre Varnek,et al.  Substructural fragments: an universal language to encode reactions, molecular and supramolecular structures , 2005, J. Comput. Aided Mol. Des..

[14]  Igor V. Tetko,et al.  Combinatorial QSAR Modeling of Chemical Toxicants Tested against Tetrahymena pyriformis , 2008, J. Chem. Inf. Model..

[15]  G. Zlokarnik,et al.  In silico prediction of drug safety: despite progress there is abundant room for improvement. , 2004, Drug discovery today. Technologies.

[16]  A. Tropsha,et al.  Beware of q2! , 2002, Journal of molecular graphics & modelling.

[17]  J. Dearden,et al.  How not to develop a quantitative structure–activity or structure–property relationship (QSAR/QSPR) , 2009, SAR and QSAR in environmental research.

[18]  William L. Jorgensen,et al.  QSAR/QSPR and Proprietary Data , 2006, Journal of Chemical Information and Modeling.

[19]  A Tropsha,et al.  QSAR analysis of the toxicity of nitroaromatics in Tetrahymena pyriformis: structural factors and possible modes of action , 2011, SAR and QSAR in environmental research.

[20]  Emilio Xavier Esposito,et al.  Findings of the Challenge To Predict Aqueous Solubility , 2009, J. Chem. Inf. Model..

[21]  Arthur M. Doweyko,et al.  QSAR: dead or alive? , 2008, J. Comput. Aided Mol. Des..

[22]  M. Boyd,et al.  New soluble-formazan assay for HIV-1 cytopathic effects: application to high-flux screening of synthetic and natural products for AIDS-antiviral activity. , 1989, Journal of the National Cancer Institute.

[23]  Klaus-Robert Müller,et al.  Benchmark Data Set for in Silico Prediction of Ames Mutagenicity , 2009, J. Chem. Inf. Model..

[24]  Bert van Bavel,et al.  European "REACH" (Registration, Evaluation, Authorisation and Restriction of Chemicals) program. , 2009 .

[25]  K. Hornbuckle,et al.  Evaluation of the Characteristics of Safety Withdrawal of Prescription Drugs from Worldwide Pharmaceutical Markets-1960 to 1999 , 2001 .

[26]  Stephen R. Johnson,et al.  The Trouble with QSAR (or How I Learned To Stop Worrying and Embrace Fallacy) , 2008, J. Chem. Inf. Model..

[27]  David M. Reif,et al.  Profiling Chemicals Based on Chronic Toxicity Results from the U.S. EPA ToxRef Database , 2008, Environmental health perspectives.

[28]  P. Bernardi,et al.  High concordance of drug-induced human hepatotoxicity with in vitro cytotoxicity measured in a novel cell-based model using high content screening , 2006, Archives of Toxicology.

[29]  G. Betton,et al.  The predictivity of the toxicity of pharmaceuticals in humans from animal data--an interim assessment. , 1998, Toxicology letters.

[30]  Paul B Watkins,et al.  Drug‐induced liver injury: Summary of a single topic clinical research conference , 2006, Hepatology.

[31]  H. Kubinyi,et al.  Three-dimensional quantitative similarity-activity relationships (3D QSiAR) from SEAL similarity matrices. , 1998, Journal of medicinal chemistry.

[32]  Yvonne C. Martin,et al.  Let’s not forget tautomers , 2009, J. Comput. Aided Mol. Des..

[33]  Michal Vieth,et al.  Geometric Accuracy of Three-Dimensional Molecular Overlays , 2006, J. Chem. Inf. Model..

[34]  J. Leszczynski,et al.  The effect of nitroaromatics' composition on their toxicity in vivo: novel, efficient non-additive 1D QSAR analysis. , 2008, Chemosphere.

[35]  Damian Szklarczyk,et al.  STITCH 2: an interaction network database for small molecules and proteins , 2009, Nucleic Acids Res..

[36]  A. Tropsha,et al.  Beware of q 2 , 2002 .

[37]  Tudor I. Oprea,et al.  WOMBAT and WOMBAT‐PK: Bioactivity Databases for Lead and Drug Discovery , 2008 .

[38]  Jerzy Leszczynski,et al.  Consensus QSAR Modeling of Phosphor‐Containing Chiral AChE Inhibitors , 2009 .

[39]  Robert C. Glen,et al.  Solubility Challenge: Can You Predict Solubilities of 32 Molecules Using a Database of 100 Reliable Measurements? , 2008, J. Chem. Inf. Model..

[40]  Tingjun Hou,et al.  ADME Evaluation in Drug Discovery, 6. Can Oral Bioavailability in Humans Be Effectively Predicted by Simple Molecular Property-Based Rules? , 2007, J. Chem. Inf. Model..

[41]  Igor V. Tetko,et al.  Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection , 2008, J. Chem. Inf. Model..

[42]  Ivonne M C M Rietjens,et al.  Promises and pitfalls of quantitative structure-activity relationship approaches for predicting metabolism and toxicity. , 2008, Chemical research in toxicology.

[43]  P Smith,et al.  Concordance of the toxicity of pharmaceuticals in humans and in animals. , 2000, Regulatory toxicology and pharmacology : RTP.

[44]  Antony J. Williams,et al.  Free online resources enabling crowd-sourced drug discovery , 2009 .

[45]  T. Insel,et al.  NIH Molecular Libraries Initiative , 2004, Science.

[46]  Gerald M. Maggiora,et al.  On Outliers and Activity Cliffs-Why QSAR Often Disappoints , 2006, J. Chem. Inf. Model..

[47]  Jerzy Leszczynski,et al.  The effects of characteristics of substituents on toxicity of the nitroaromatics: HiT QSAR study , 2008, J. Comput. Aided Mol. Des..

[48]  D. Young,et al.  Are the Chemical Structures in Your QSAR Correct , 2008 .

[49]  Tudor I. Oprea,et al.  WOMBAT: World of Molecular Bioactivity , 2005 .

[50]  John M. Barnard,et al.  Clustering Methods and Their Uses in Computational Chemistry , 2003 .