A New Workflow for QSAR Model Development from Small Data Sets: Integration of Data Curation, Exhaustive Double Cross-Validation and A Set of Optimal Model Selection Techniques.

Quantitative structure-activity relationship (QSAR) modeling is a well-known in silico technique with extensive applications in several major fields such as drug design, predictive toxicology, materials science, food science, etc. Handling small-sized data sets due to the lack of experimental data for specialized endpoints is a crucial task for the QSAR researchers. In the present study, we propose an integrated workflow/scheme capable of dealing with the small data set modeling that integrates data set curation, "exhaustive" double cross-validation and a set of optimal model selection techniques including consensus predictions. We have developed two software tools, namely, Small Dataset Curator version 1.0.0 and Small Dataset Modeler version 1.0.0 to effortlessly execute the proposed workflow. These tools are freely available for download from https://dtclab.webs.com/software-tools. We have performed case studies employing seven diverse data sets to demonstrate the performance of the proposed scheme (including data curation) for small data set QSAR modeling. The case studies also confirm the usability and stability of the developed software tools.

[1]  Venkat Venkatasubramanian,et al.  Genetic Algorithms: Introduction and Applications , 2002 .

[2]  Douglas M. Hawkins,et al.  Assessing Model Fit by Cross-Validation , 2003, J. Chem. Inf. Comput. Sci..

[3]  Alexander Golbraikh,et al.  QSAR Modeling Using Chirality Descriptors Derived from Molecular Topology , 2003, J. Chem. Inf. Comput. Sci..

[4]  Igor V. Tetko,et al.  Combinatorial QSAR Modeling of Chemical Toxicants Tested against Tetrahymena pyriformis , 2008, J. Chem. Inf. Model..

[5]  Alexander Tropsha,et al.  Quantitative structure-activity relationship modeling of rat acute toxicity by oral exposure. , 2009, Chemical research in toxicology.

[6]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[7]  Alexander Tropsha,et al.  Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research , 2010, J. Chem. Inf. Model..

[8]  Kunal Roy,et al.  On various metrics used for validation of predictive QSAR models with applications in virtual screening and focused library design. , 2011, Combinatorial chemistry & high throughput screening.

[9]  Paola Gramatica,et al.  QSARINS‐chem: Insubria datasets and new QSAR/QSPR models for environmental pollutants in QSARINS , 2014, J. Comput. Chem..

[10]  Knut Baumann,et al.  Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation , 2014, Journal of Cheminformatics.

[11]  Paola Gramatica,et al.  Metabolic biotransformation half-lives in fish: QSAR modeling and consensus analysis. , 2014, The Science of the total environment.

[12]  Maykel Cruz-Monteagudo,et al.  Activity cliffs in drug discovery: Dr Jekyll or Mr Hyde? , 2014, Drug discovery today.

[13]  Supratik Kar,et al.  On a simple approach for determining applicability domain of QSAR models , 2015 .

[14]  Tomasz Puzyn,et al.  “NanoBRIDGES” software: Open access tools to perform QSAR and nano-QSAR modeling , 2015 .

[15]  K. Roy,et al.  Be aware of error measures. Further studies on validation of predictive QSAR models , 2016 .

[16]  Alexander Tropsha,et al.  Trust, but Verify II: A Practical Guide to Chemogenomics Data Curation , 2016, J. Chem. Inf. Model..

[17]  Paola Gramatica,et al.  A Historical Excursus on the Statistical Validation Parameters for QSAR Models: A Clarification Concerning Metrics and Terminology , 2016, J. Chem. Inf. Model..

[18]  Roberto Todeschini,et al.  Beware of Unreliable Q2! A Comparative Study of Regression Metrics for Predictivity Assessment of QSAR Models , 2016, J. Chem. Inf. Model..

[19]  Kunal Roy,et al.  The “double cross-validation” software tool for MLR QSAR model development , 2016 .

[20]  Károly Héberger,et al.  Which Performance Parameters Are Best Suited to Assess the Predictive Ability of Models , 2017 .

[21]  K. Roy,et al.  Is it possible to improve the quality of predictions from an “intelligent” use of multiple QSAR/QSPR/QSTR models? , 2018 .

[22]  Emilio Benfenati,et al.  A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications , 2018, Journal of Cheminformatics.

[23]  Peter Ertl,et al.  Chiral Cliffs: Investigating the Influence of Chirality on Binding Affinity , 2018, ChemMedChem.

[24]  Samina Kausar,et al.  An automated framework for QSAR model building , 2018, Journal of Cheminformatics.

[25]  T. Puzyn,et al.  Identifying natural compounds as multi-target-directed ligands against Alzheimer’s disease: an in silico approach , 2019, Journal of biomolecular structure & dynamics.