A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications

The quality of data used for QSAR model derivation is extremely important as it strongly affects the final robustness and predictive power of the model. Ambiguous or wrong structures need to be carefully checked, because they lead to errors in calculation of descriptors, hence leading to meaningless results. The increasing amounts of data, however, have often made it hard to check of very large databases manually. In the light of this, we designed and implemented a semi-automated workflow integrating structural data retrieval from several web-based databases, automated comparison of these data, chemical structure cleaning, selection and standardization of data into a consistent, ready-to-use format that can be employed for modeling. The workflow integrates best practices for data curation that have been suggested in the recent literature. The workflow has been implemented with the freely available KNIME software and is freely available to the cheminformatics community for improvement and application to a broad range of chemical datasets.

[1]  Pekka Tiikkainen,et al.  Analysis of Commercial and Public Bioactivity Databases , 2012, J. Chem. Inf. Model..

[2]  Franco Lombardo,et al.  Trend Analysis of a Database of Intravenous Pharmacokinetic Parameters in Humans for 670 Drug Compounds , 2008, Drug Metabolism and Disposition.

[3]  John P. Overington,et al.  Chemical databases: curation or integration by user-defined equivalence? , 2015, Drug discovery today. Technologies.

[4]  Joo Chuan Tong,et al.  Recent advances in computer-aided drug design , 2009, Briefings Bioinform..

[5]  J. Dearden,et al.  How not to develop a quantitative structure–activity or structure–property relationship (QSAR/QSPR) , 2009, SAR and QSAR in environmental research.

[6]  David S. Wishart,et al.  DrugBank 5.0: a major update to the DrugBank database for 2018 , 2017, Nucleic Acids Res..

[7]  Antony J. Williams,et al.  The CompTox Chemistry Dashboard: a community data resource for environmental chemistry , 2017, Journal of Cheminformatics.

[8]  Thorsten Meinl,et al.  KNIME: The Konstanz Information Miner , 2007, GfKl.

[9]  A M Richard,et al.  An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling$ , 2016, SAR and QSAR in environmental research.

[10]  D. Young,et al.  Are the Chemical Structures in Your QSAR Correct , 2008 .

[11]  Tudor I. Oprea,et al.  WOMBAT: World of Molecular Bioactivity , 2005 .

[12]  Paola Gramatica,et al.  QSAR Modeling is not “Push a Button and Find a Correlation”: A Case Study of Toxicity of (Benzo‐)triazoles on Algae , 2012, Molecular informatics.

[13]  Sorel Muresan,et al.  Quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compounds , 2009, J. Cheminformatics.

[14]  Stephen R. Heller,et al.  InChI, the IUPAC International Chemical Identifier , 2015, Journal of Cheminformatics.

[15]  Noel M. O'Boyle Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI , 2012, Journal of Cheminformatics.

[16]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[17]  J. Dearden,et al.  QSAR modeling: where have you been? Where are you going to? , 2014, Journal of medicinal chemistry.

[18]  Yvonne C. Martin,et al.  Let’s not forget tautomers , 2009, J. Comput. Aided Mol. Des..

[19]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[20]  D S Sharp,et al.  Random sampling or 'random' model in skin flux measurements? [Commentary on "Investigation of the mechanism of flux across human skin in vitro by quantitative structure-permeability relationships"]. , 2001, European journal of pharmaceutical sciences : official journal of the European Federation for Pharmaceutical Sciences.

[21]  Emilio Benfenati,et al.  Assessment and validation of the CAESAR predictive model for bioconcentration factor (BCF) in fish , 2010, Chemistry Central journal.

[22]  Antony J. Williams,et al.  ToxCast Chemical Landscape: Paving the Road to 21st Century Toxicology. , 2016, Chemical research in toxicology.

[23]  Ruili Huang,et al.  CERAPP: Collaborative Estrogen Receptor Activity Prediction Project , 2016, Environmental health perspectives.

[24]  Alexander Tropsha,et al.  Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research , 2010, J. Chem. Inf. Model..

[25]  John J. Irwin,et al.  ZINC 15 – Ligand Discovery for Everyone , 2015, J. Chem. Inf. Model..