Proteochemometric modeling in a Bayesian framework

Proteochemometrics (PCM) is an approach for bioactivity predictive modeling which models the relationship between protein and chemical information. Gaussian Processes (GP), based on Bayesian inference, provide the most objective estimation of the uncertainty of the predictions, thus permitting the evaluation of the applicability domain (AD) of the model. Furthermore, the experimental error on bioactivity measurements can be used as input for this probabilistic model.In this study, we apply GP implemented with a panel of kernels on three various (and multispecies) PCM datasets. The first dataset consisted of information from 8 human and rat adenosine receptors with 10,999 small molecule ligands and their binding affinity. The second consisted of the catalytic activity of four dengue virus NS3 proteases on 56 small peptides. Finally, we have gathered bioactivity information of small molecule ligands on 91 aminergic GPCRs from 9 different species, leading to a dataset of 24,593 datapoints with a matrix completeness of only 2.43%.GP models trained on these datasets are statistically sound, at the same level of statistical significance as Support Vector Machines (SVM), with R02 values on the external dataset ranging from 0.68 to 0.92, and RMSEP values close to the experimental error. Furthermore, the best GP models obtained with the normalized polynomial and radial kernels provide intervals of confidence for the predictions in agreement with the cumulative Gaussian distribution. GP models were also interpreted on the basis of individual targets and of ligand descriptors. In the dengue dataset, the model interpretation in terms of the amino-acid positions in the tetra-peptide ligands gave biologically meaningful results.

[1]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[2]  H. V. van Vlijmen,et al.  Identifying novel adenosine receptor ligands by simultaneous proteochemometric modeling of rat and human bioactivity data. , 2012, Journal of medicinal chemistry.

[3]  Douglas M. Hawkins,et al.  Assessing Model Fit by Cross-Validation , 2003, J. Chem. Inf. Comput. Sci..

[4]  G. V. van Westen,et al.  GPCR structure and activation: an essential role for the first extracellular loop in activating the adenosine A2B receptor , 2011, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[5]  Igor Kononenko,et al.  An overview of advances in reliability estimation of individual predictions in machine learning , 2009, Intell. Data Anal..

[6]  Michele Magrane,et al.  UniProt Knowledgebase: a hub of integrated protein data , 2011, Database J. Biol. Databases Curation.

[7]  Gerard J. P. van Westen,et al.  Proteochemometric modeling as a tool to design selective compounds and for extrapolating to novel targets , 2011 .

[8]  B. Fredholm,et al.  International Union of Pharmacology. XXV. Nomenclature and classification of adenosine receptors. , 2001, Pharmacological reviews.

[9]  Carl E. Rasmussen,et al.  Gaussian Processes for Machine Learning (GPML) Toolbox , 2010, J. Mach. Learn. Res..

[10]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[11]  Christian Kramer,et al.  QSARs, data and error in the modern age of drug discovery. , 2012, Current topics in medicinal chemistry.

[12]  Peter C. Fox,et al.  Statistical variation in progressive scrambling , 2004, J. Comput. Aided Mol. Des..

[13]  Pekka Tiikkainen,et al.  Estimating Error Rates in Bioactivity Databases , 2013, J. Chem. Inf. Model..

[14]  John P. Overington,et al.  A ligand's-eye view of protein similarity , 2013, Nature Methods.

[15]  Frederick P. Roth,et al.  Chemical substructures that enrich for biological activity , 2008, Bioinform..

[16]  Andreas Bender,et al.  How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space , 2009, J. Chem. Inf. Model..

[17]  John P. Overington,et al.  Global Analysis of Small Molecule Binding to Related Protein Targets , 2012, PLoS Comput. Biol..

[18]  Frank R. Burden,et al.  Quantitative Structure-Activity Relationship Studies Using Gaussian Processes , 2001, J. Chem. Inf. Comput. Sci..

[19]  Peteris Prusis,et al.  Proteochemometric modeling of HIV protease susceptibility , 2008, BMC Bioinformatics.

[20]  Fuzhen Zhang The Schur complement and its applications , 2005 .

[21]  Klaus-Robert Müller,et al.  Accurate Solubility Prediction with Error Bars for Electrolytes: A Machine Learning Approach , 2007, J. Chem. Inf. Model..

[22]  A. Zeileis Econometric Computing with HC and HAC Covariance Matrix Estimators , 2004 .

[23]  Volker Tresp,et al.  A Bayesian Committee Machine , 2000, Neural Computation.

[24]  S. Wold,et al.  New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. , 1998, Journal of medicinal chemistry.

[25]  Marvin Johnson,et al.  Concepts and applications of molecular similarity , 1990 .

[26]  Matthew D. Segall,et al.  Gaussian Processes for Classification: QSAR Modeling of ADMET and Target Activity , 2010, J. Chem. Inf. Model..

[27]  M. Murcko,et al.  Chemogenomic approaches to drug discovery. , 2001, Current opinion in chemical biology.

[28]  Peng Zhou,et al.  Gaussian process: an alternative approach for QSAM modeling of peptides , 2008, Amino Acids.

[29]  Tropsha Alexander,et al.  Predictive quantitative structure-activity relationships modeling: Development and validation of QSAR models , 2010 .

[30]  Marc G. Genton,et al.  Classes of Kernels for Machine Learning: A Statistics Perspective , 2002, J. Mach. Learn. Res..

[31]  Davide Ballabio,et al.  Evaluation of model predictive ability by external validation techniques , 2010 .

[32]  Bin Wu,et al.  Characterization of the binding profile of peptide to transporter associated with antigen processing (TAP) using Gaussian process regression , 2011, Comput. Biol. Medicine.

[33]  Lehel Csató,et al.  Sparse On-Line Gaussian Processes , 2002, Neural Computation.

[34]  Scott P. Brown,et al.  Healthy skepticism: assessing realistic model performance. , 2009, Drug discovery today.

[35]  P. Prusis,et al.  Design and evaluation of substrate-based octapeptide and non substrate-based tetrapeptide inhibitors of dengue virus NS2B-NS3 proteases. , 2013, Biochemical and biophysical research communications.

[36]  Andreas Krause,et al.  Navigating the protein fitness landscape with Gaussian processes , 2012, Proceedings of the National Academy of Sciences.

[37]  Ola Spjuth,et al.  Services for prediction of drug susceptibility for HIV proteases and reverse transcriptases at the HIV drug research centre , 2011, Bioinform..

[38]  J. Skilling Nested sampling for general Bayesian computation , 2006 .

[39]  John P. Overington,et al.  Chemogenomics approaches for receptor deorphanization and extensions of the chemogenomics concept to phenotypic space. , 2011, Current topics in medicinal chemistry.

[40]  A. Tropsha,et al.  Beware of q2! , 2002, Journal of molecular graphics & modelling.

[41]  F. Tian,et al.  Modeling and prediction of binding affinities between the human amphiphysin SH3 domain and its peptide ligands using genetic algorithm‐Gaussian processes , 2008, Biopolymers.

[42]  G. V. van Westen,et al.  Importance of the extracellular loops in G protein-coupled receptors for ligand recognition and receptor activation. , 2011, Trends in pharmacological sciences.

[43]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[44]  Michael Bieler,et al.  The Role of Chemogenomics in the Pharmaceutical Industry , 2012 .

[45]  D. Gloriam,et al.  Definition of the G protein-coupled receptor transmembrane bundle binding pocket and calculation of receptor similarities for drug design. , 2009, Journal of medicinal chemistry.

[46]  Oakland J. Peters,et al.  Predicting new indications for approved drugs using a proteochemometric method. , 2012, Journal of medicinal chemistry.

[47]  Gerard J. P. van Westen,et al.  Significantly Improved HIV Inhibitor Efficacy Prediction Employing Proteochemometric Models Generated From Antivirogram Data , 2013, PLoS Comput. Biol..

[48]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[49]  Gilles Marcou,et al.  Computational chemogenomics: Is it more than inductive transfer? , 2014, Journal of Computer-Aided Molecular Design.

[50]  A. Vulpetti,et al.  The experimental uncertainty of heterogeneous public K(i) data. , 2012, Journal of medicinal chemistry.

[51]  Joshua B. Tenenbaum,et al.  Structure Discovery in Nonparametric Regression through Compositional Kernel Search , 2013, ICML.

[52]  Prabhat,et al.  Parallelizing Gaussian Process Calculations in R , 2013, ArXiv.

[53]  R. Stevens,et al.  The 2.6 Angstrom Crystal Structure of a Human A2A Adenosine Receptor Bound to an Antagonist , 2008, Science.

[54]  Zhiwei Cao,et al.  Study on human GPCR-inhibitor interactions by proteochemometric modeling. , 2013, Gene.

[55]  P. Prusis,et al.  Proteochemometrics analysis of substrate interactions with dengue virus NS3 proteases. , 2008, Bioorganic & medicinal chemistry.

[56]  Scott D. Kahn,et al.  Current Status of Methods for Defining the Applicability Domain of (Quantitative) Structure-Activity Relationships , 2005, Alternatives to laboratory animals : ATLA.

[57]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[58]  A. Bender,et al.  Circular fingerprints: flexible molecular descriptors with applications from physical chemistry to ADME. , 2006, IDrugs : the investigational drugs journal.

[59]  Johan Gottfries,et al.  The drug designer´s guide to selectivity , 2006 .

[60]  Michael J. Keiser,et al.  Predicting new molecular targets for known drugs , 2009, Nature.

[61]  Zhiwei Cao,et al.  Proteochemometric Modeling of the Bioactivity Spectra of HIV-1 Protease Inhibitors by Introducing Protein-Ligand Interaction Fingerprint , 2012, PloS one.

[62]  Wendy A. Warr Data sharing matters , 2014, Journal of Computer-Aided Molecular Design.

[63]  Zheng Qifu,et al.  Support Vector Machine Based on Universal Kernel Function and Its Application in Quantitative Structure - Toxicity Relationship Model , 2009, 2009 International Forum on Information Technology and Applications.

[64]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[65]  Gunnar Rätsch,et al.  Support Vector Machines and Kernels for Computational Biology , 2008, PLoS Comput. Biol..

[66]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[67]  H. Mewes,et al.  Can we estimate the accuracy of ADME-Tox predictions? , 2006, Drug discovery today.

[68]  Gabriel Kronberger,et al.  Evolution of Covariance Functions for Gaussian Process Regression Using Genetic Programming , 2013, EUROCAST.

[69]  G. Schneider,et al.  Combining on-chip synthesis of a focused combinatorial library with computational target prediction reveals imidazopyridine GPCR ligands. , 2014, Angewandte Chemie.

[70]  Andreas Bender,et al.  In Silico Target Predictions: Defining a Benchmarking Data Set and Comparison of Performance of the Multiclass Naïve Bayes and Parzen-Rosenblatt Window , 2013, J. Chem. Inf. Model..

[71]  H. Kubinyi,et al.  Three-dimensional quantitative similarity-activity relationships (3D QSiAR) from SEAL similarity matrices. , 1998, Journal of medicinal chemistry.

[72]  Michael J. Keiser,et al.  Large Scale Prediction and Testing of Drug Activity on Side-Effect Targets , 2012, Nature.

[73]  Michael T. M. Emmerich,et al.  Chemogenomics: Looking at biology through the lens of chemistry , 2009, Stat. Anal. Data Min..

[74]  H. V. van Vlijmen,et al.  Which Compound to Select in Lead Optimization? Prospectively Validated Proteochemometric Models Guide Preclinical Development , 2011, PloS one.

[75]  Jun Gao,et al.  Screening of selective histone deacetylase inhibitors by proteochemometric modeling , 2012, BMC Bioinformatics.

[76]  A. Vulpetti,et al.  Comparability of Mixed IC50 Data – A Statistical Analysis , 2013, PloS one.

[77]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[78]  Andreas Bender,et al.  From in silico target prediction to multi-target drug design: current databases, methods and applications. , 2011, Journal of proteomics.

[79]  Simo Puntanen,et al.  Schur complements in statistics and probability , 2005 .

[80]  Tomasz Arodz,et al.  Computational methods in developing quantitative structure-activity relationships (QSAR): a review. , 2006, Combinatorial chemistry & high throughput screening.

[81]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[82]  Roberto Todeschini,et al.  Comparison of Different Approaches to Define the Applicability Domain of QSAR Models , 2012, Molecules.

[83]  A. Tropsha,et al.  Predictive quantitative structure-activity relationship modeling , 2007 .

[84]  Gábor Csányi,et al.  Gaussian Processes: A Method for Automatic QSAR Modeling of ADME Properties , 2007, J. Chem. Inf. Model..