Chemically Aware Model Builder (camb): an R package for property and bioactivity modelling of small molecules

AbstractBackgroundIn silico predictive models have proved to be valuable for the optimisation of compound potency, selectivity and safety profiles in the drug discovery process.Resultscamb is an R package that provides an environment for the rapid generation of quantitative Structure-Property and Structure-Activity models for small molecules (including QSAR, QSPR, QSAM, PCM) and is aimed at both advanced and beginner R users. camb's capabilities include the standardisation of chemical structure representation, computation of 905 one-dimensional and 14 fingerprint type descriptors for small molecules, 8 types of amino acid descriptors, 13 whole protein sequence descriptors, filtering methods for feature selection, generation of predictive models (using an interface to the R package caret), as well as techniques to create model ensembles using techniques from the R package caretEnsemble). Results can be visualised through high-quality, customisable plots (R package ggplot2).Conclusions Overall, camb constitutes an open-source framework to perform the following steps: (1) compound standardisation, (2) molecular and protein descriptor calculation, (3) descriptor pre-processing and model training, visualisation and validation, and (4) bioactivity/property prediction for new molecules. camb aims to speed model generation, in order to provide reproducibility and tests of robustness. QSPR and proteochemometric case studies are included which demonstrate camb's application.Graphical abstractFrom compounds and data to models: a complete model building workflow in one package.

[1]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[2]  Tao Jiang,et al.  ChemmineR: a compound mining framework for R , 2008, Bioinform..

[3]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[4]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[5]  Davide Ballabio,et al.  Evaluation of model predictive ability by external validation techniques , 2010 .

[6]  Isidro Cortes-Ciriano,et al.  Prediction of the potency of mammalian cyclooxygenase inhibitors with ensemble proteochemometric modeling , 2015, Journal of Cheminformatics.

[7]  Isidro Cortes-Ciriano,et al.  Polypharmacology modelling using proteochemometrics (PCM): recent methodological developments, applications to target families, and future prospects , 2015 .

[8]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[9]  John P. Overington,et al.  Global Analysis of Small Molecule Binding to Related Protein Targets , 2012, PLoS Comput. Biol..

[10]  Gerard J. P. van Westen,et al.  Proteochemometric modeling as a tool to design selective compounds and for extrapolating to novel targets , 2011 .

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  A. Tropsha,et al.  Beware of q2! , 2002, Journal of molecular graphics & modelling.

[13]  Helena Strömbergsson,et al.  Quantitative chemogenomics: machine-learning models of protein-ligand interaction. , 2011, Current topics in medicinal chemistry.

[14]  Rich Caruana,et al.  Ensemble selection from libraries of models , 2004, ICML.

[15]  Max Kuhn,et al.  The use of the R language for medicinal chemistry applications. , 2012, Current topics in medicinal chemistry.

[16]  Pedro M. Valero-Mora,et al.  ggplot2: Elegant Graphics for Data Analysis , 2010 .

[17]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[18]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[19]  William L. Smith,et al.  Coxibs interfere with the action of aspirin by binding tightly to one monomer of cyclooxygenase-1 , 2009, Proceedings of the National Academy of Sciences.

[20]  D. Rognan Chemogenomic approaches to rational drug design , 2007, British journal of pharmacology.

[21]  H. V. van Vlijmen,et al.  Identifying novel adenosine receptor ligands by simultaneous proteochemometric modeling of rat and human bioactivity data. , 2012, Journal of medicinal chemistry.

[22]  Frederick P. Roth,et al.  Chemical substructures that enrich for biological activity , 2008, Bioinform..

[23]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[24]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[25]  James G. Nourse,et al.  Reoptimization of MDL Keys for Use in Drug Discovery , 2002, J. Chem. Inf. Comput. Sci..

[26]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[27]  Isidro Cortes-Ciriano,et al.  Benchmarking of protein descriptor sets in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptor sets , 2013, Journal of Cheminformatics.

[28]  Douglas M. Hawkins,et al.  Assessing Model Fit by Cross-Validation , 2003, J. Chem. Inf. Comput. Sci..

[29]  Andreas Bender,et al.  Databases: Compound bioactivities go public , 2010 .

[30]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo-and Bioinformatics , 2003, J. Chem. Inf. Comput. Sci..

[31]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[32]  Evan Bolton,et al.  PubChem's BioAssay Database , 2011, Nucleic Acids Res..

[33]  Rajarshi Guha,et al.  Chemical Informatics Functionality in R , 2007 .

[34]  Tingjun Hou,et al.  Development of Reliable Aqueous Solubility Models and Their Application in Druglike Analysis , 2007, J. Chem. Inf. Model..

[35]  Isidro Cortes-Ciriano,et al.  Proteochemometric modeling in a Bayesian framework , 2014, Journal of Cheminformatics.

[36]  Lemont B. Kier,et al.  Electrotopological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information , 1995, J. Chem. Inf. Comput. Sci..

[37]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[38]  Gerard J. P. van Westen,et al.  Benchmarking of protein descriptor sets in proteochemometric modeling (part 1): comparative study of 13 amino acid descriptor sets , 2013, Journal of Cheminformatics.