Benchmarking of protein descriptor sets in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptor sets

BackgroundWhile a large body of work exists on comparing and benchmarking descriptors of molecular structures, a similar comparison of protein descriptor sets is lacking. Hence, in the current work a total of 13 amino acid descriptor sets have been benchmarked with respect to their ability of establishing bioactivity models. The descriptor sets included in the study are Z-scales (3 variants), VHSE, T-scales, ST-scales, MS-WHIM, FASGAI, BLOSUM, a novel protein descriptor set (termed ProtFP (4 variants)), and in addition we created and benchmarked three pairs of descriptor combinations. Prediction performance was evaluated in seven structure-activity benchmarks which comprise Angiotensin Converting Enzyme (ACE) dipeptidic inhibitor data, and three proteochemometric data sets, namely (1) GPCR ligands modeled against a GPCR panel, (2) enzyme inhibitors (NNRTIs) with associated bioactivities against a set of HIV enzyme mutants, and (3) enzyme inhibitors (PIs) with associated bioactivities on a large set of HIV enzyme mutants.ResultsThe amino acid descriptor sets compared here show similar performance (<0.1 log units RMSE difference and <0.1 difference in MCC), while errors for individual proteins were in some cases found to be larger than those resulting from descriptor set differences ( > 0.3 log units RMSE difference and >0.7 difference in MCC). Combining different descriptor sets generally leads to better modeling performance than utilizing individual sets. The best performers were Z-scales (3) combined with ProtFP (Feature), or Z-Scales (3) combined with an average Z-Scale value for each target, while ProtFP (PCA8), ST-Scales, and ProtFP (Feature) rank last.ConclusionsWhile amino acid descriptor sets capture different aspects of amino acids their ability to be used for bioactivity modeling is still – on average – surprisingly similar. Still, combining sets describing complementary information consistently leads to small but consistent improvement in modeling performance (average MCC 0.01 better, average RMSE 0.01 log units lower). Finally, performance differences exist between the targets compared thereby underlining that choosing an appropriate descriptor set is of fundamental for bioactivity modeling, both from the ligand- as well as the protein side.

[1]  Andrea Zaliani,et al.  MS-WHIM Scores for Amino Acids: A New 3D-Description for Peptide QSAR and QSPR Studies , 1999, J. Chem. Inf. Comput. Sci..

[2]  Dong-Sheng Cao,et al.  propy: a tool to generate various modes of Chou's PseAAC , 2013, Bioinform..

[3]  S. Wold,et al.  Minimum analogue peptide sets (MAPS) for quantitative structure-activity relationships. , 2009, International journal of peptide and protein research.

[4]  Charlotte M. Deane,et al.  JOY: protein sequence-structure representation and analysis , 1998, Bioinform..

[5]  T. Lundstedt,et al.  Development of proteo-chemometrics: a novel technology for the analysis of drug-receptor interactions. , 2001, Biochimica et biophysica acta.

[6]  Gerard J. P. van Westen,et al.  Proteochemometric modeling as a tool to design selective compounds and for extrapolating to novel targets , 2011 .

[7]  D. Richman,et al.  2022 update of the drug resistance mutations in HIV-1. , 2022, Topics in antiviral medicine.

[8]  S Wold,et al.  Quantitative sequence-activity models (QSAM)--tools for sequence design. , 1993, Nucleic acids research.

[9]  Zhiliang Li,et al.  Factor Analysis Scale of Generalized Amino Acid Information as the Source of a New Set of Descriptors for Elucidating the Structure and Activity Relationships of Cationic Antimicrobial Peptides , 2007 .

[10]  J. Komorowski,et al.  Proteochemometrics mapping of the interaction space for retroviral proteases and their substrates. , 2009, Bioorganic & medicinal chemistry.

[11]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[12]  Kathrin Heikamp,et al.  Comparison of Confirmed Inactive and Randomly Selected Compounds as Negative Training Examples in Support Vector Machine-Based Virtual Screening , 2013, J. Chem. Inf. Model..

[13]  W. Dunn,et al.  Amino acid side chain descriptors for quantitative structure-activity relationship studies of peptide analogues. , 1995, Journal of medicinal chemistry.

[14]  Peng Zhou,et al.  Gaussian process: an alternative approach for QSAM modeling of peptides , 2008, Amino Acids.

[15]  Jarl E. S. Wikberg,et al.  Proteochemometric Modeling of Drug Resistance over the Mutational Space for Multiple HIV Protease Variants and Multiple Protease Inhibitors , 2009, J. Chem. Inf. Model..

[16]  Jarl E. S. Wikberg,et al.  Interaction Model Based on Local Protein Substructures Generalizes to the Entire Structural Enzyme-Ligand Space , 2008, J. Chem. Inf. Model..

[17]  H. V. van Vlijmen,et al.  Which Compound to Select in Lead Optimization? Prospectively Validated Proteochemometric Models Guide Preclinical Development , 2011, PloS one.

[18]  Andreas Bender,et al.  Handbook of Chemoinformatics Algorithms , 2010 .

[19]  Didier Rognan,et al.  Protein-Ligand-Based Pharmacophores: Generation and Utility Assessment in Computational Ligand Profiling , 2012, J. Chem. Inf. Model..

[20]  Michael J. Keiser,et al.  Relating protein pharmacology by ligand chemistry , 2007, Nature Biotechnology.

[21]  P. Prusis,et al.  Melanocortin Receptors: Ligands and Proteochemometrics Modeling , 2003, Annals of the New York Academy of Sciences.

[22]  Jean-Philippe Vert,et al.  Virtual screening of GPCRs: An in silico chemogenomics approach , 2008, BMC Bioinformatics.

[23]  R. Leurs,et al.  A structural chemogenomics analysis of aminergic GPCRs: lessons for histamine receptor ligand design , 2013, British journal of pharmacology.

[24]  A. Tropsha,et al.  Predictive quantitative structure-activity relationship modeling , 2007 .

[25]  S. Wold,et al.  New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. , 1998, Journal of medicinal chemistry.

[26]  John P. Overington,et al.  Chemogenomics approaches for receptor deorphanization and extensions of the chemogenomics concept to phenotypic space. , 2011, Current topics in medicinal chemistry.

[27]  Shengshi Z. Li,et al.  A new set of amino acid descriptors and its application in peptide QSARs. , 2005, Biopolymers.

[28]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[29]  Brian T. Foley,et al.  Numbering Positions in HIV Relative to HXB 2 CG , 1999 .

[30]  Andreas Bender,et al.  How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space , 2009, J. Chem. Inf. Model..

[31]  Peteris Prusis,et al.  Prediction of indirect interactions in proteins , 2006, BMC Bioinformatics.

[32]  R. Shafer,et al.  Update of the drug resistance mutations in HIV-1: March 2013. , 2013, Topics in antiviral medicine.

[33]  Alexander G. Georgiev,et al.  Interpretable Numerical Descriptors of Amino Acid Space , 2009, J. Comput. Biol..

[34]  Peteris Prusis,et al.  Proteochemometric modeling of HIV protease susceptibility , 2008, BMC Bioinformatics.

[35]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[36]  T. Lundstedt,et al.  Proteochemometrics modeling of the interaction of amine G-protein coupled receptors with a diverse set of ligands. , 2002, Molecular pharmacology.

[37]  L. Jiang,et al.  PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence , 2006, Nucleic Acids Res..

[38]  Gerard J. P. van Westen,et al.  Significantly Improved HIV Inhibitor Efficacy Prediction Employing Proteochemometric Models Generated From Antivirogram Data , 2013, PLoS Comput. Biol..

[39]  Z. R. Li,et al.  Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence , 2006, Nucleic Acids Res..

[40]  Jarl E. S. Wikberg,et al.  Kinome-wide interaction modelling using alignment-based and alignment-independent approaches for kinase description and linear and non-linear data analysis techniques , 2010, BMC Bioinformatics.

[41]  F. Tian,et al.  T-scale as a novel vector of topological descriptors for amino acids and its application in QSARs of peptides , 2007 .

[42]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[43]  G. V. van Westen,et al.  Structure-Based Identification of OATP1B1/3 Inhibitors , 2013, Molecular Pharmacology.

[44]  John P. Overington,et al.  A ligand's-eye view of protein similarity , 2013, Nature Methods.

[45]  Dong-Sheng Cao,et al.  Genome-Scale Screening of Drug-Target Associations Relevant to Ki Using a Chemogenomics Approach , 2013, PloS one.

[46]  Andreas Bender,et al.  Mining protein dynamics from sets of crystal structures using “consensus structures” , 2010, Protein science : a publication of the Protein Society.

[47]  H. V. van Vlijmen,et al.  Identifying novel adenosine receptor ligands by simultaneous proteochemometric modeling of rat and human bioactivity data. , 2012, Journal of medicinal chemistry.

[48]  Evi Kostenis,et al.  A physicogenetic method to assign ligand-binding relationships between 7TM receptors. , 2005, Bioorganic & medicinal chemistry letters.

[49]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[50]  H. Carlson Protein flexibility and drug design: how to hit a moving target. , 2002, Current opinion in chemical biology.

[51]  K. Fidelis,et al.  Generalized modeling of enzyme–ligand interactions using proteochemometrics and local protein substructures , 2006, Proteins.

[52]  Didier Rognan,et al.  A chemogenomic analysis of the transmembrane binding cavity of human G‐protein‐coupled receptors , 2005, Proteins.

[53]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[54]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[55]  S. Wold,et al.  Peptide quantitative structure-activity relationships, a multivariate approach. , 1987, Journal of medicinal chemistry.

[56]  Nathanael Weill,et al.  Development and Validation of a Novel Protein-Ligand Fingerprint To Mine Chemogenomic Space: Application to G Protein-Coupled Receptors and Their Ligands , 2009, J. Chem. Inf. Model..

[57]  Michael T. M. Emmerich,et al.  A novel chemogenomics analysis of G protein-coupled receptors (GPCRs) and their ligands: a potential strategy for receptor de-orphanization , 2010, BMC Bioinformatics.

[58]  Peteris Prusis,et al.  A Look Inside HIV Resistance through Retroviral Protease Interaction Maps , 2007, PLoS Comput. Biol..

[59]  Gert Vriend,et al.  GPCRDB information system for G protein-coupled receptors , 2003, Nucleic Acids Res..

[60]  Robert D Clark,et al.  Neighborhood behavior: a useful concept for validation of "molecular diversity" descriptors. , 1996, Journal of medicinal chemistry.

[61]  David A. Gough,et al.  Virtual Screen for Ligands of Orphan G Protein-Coupled Receptors , 2005, J. Chem. Inf. Model..

[62]  M. Shu,et al.  ST-scale as a novel amino acid descriptor and its application in QSAM of peptides and analogues , 2010, Amino Acids.

[63]  Saskia Nijmeijer,et al.  Small and colorful stones make beautiful mosaics: fragment-based chemogenomics. , 2013, Drug Discovery Today.

[64]  Gerard J. P. van Westen,et al.  Benchmarking of protein descriptor sets in proteochemometric modeling (part 1): comparative study of 13 amino acid descriptor sets , 2013, Journal of Cheminformatics.