Improving the accuracy of high-throughput protein-protein affinity prediction may require better training data

BackgroundOne goal of structural biology is to understand how a protein’s 3-dimensional conformation determines its capacity to interact with potential ligands. In the case of small chemical ligands, deconstructing a static protein-ligand complex into its constituent atom-atom interactions is typically sufficient to rapidly predict ligand affinity with high accuracy (>70% correlation between predicted and experimentally-determined affinity), a fact that is exploited to support structure-based drug design. We recently found that protein-DNA/RNA affinity can also be predicted with high accuracy using extensions of existing techniques, but protein-protein affinity could not be predicted with >60% correlation, even when the protein-protein complex was available.MethodsX-ray and NMR structures of protein-protein complexes, their associated binding affinities and experimental conditions were obtained from different binding affinity and structural databases. Statistical models were implemented using a generalized linear model framework, including the experimental conditions as new model features. We evaluated the potential for new features to improve affinity prediction models by calculating the Pearson correlation between predicted and experimental binding affinities on the training and test data after model fitting and after cross-validation. Differences in accuracy were assessed using two-sample t test and nonparametric Mann–Whitney U test.ResultsHere we evaluate a range of potential factors that may interfere with accurate protein-protein affinity prediction. We find that X-ray crystal resolution has the strongest single effect on protein-protein affinity prediction. Limiting our analyses to only high-resolution complexes (≤2.5 Å) increased the correlation between predicted and experimental affinity from 54 to 68% (p = 4.32x10−3). In addition, incorporating information on the experimental conditions under which affinities were measured (pH, temperature and binding assay) had significant effects on prediction accuracy. We also highlight a number of potential errors in large structure-affinity databases, which could affect both model training and accuracy assessment.ConclusionsThe results suggest that the accuracy of statistical models for protein-protein affinity prediction may be limited by the information present in databases used to train new models. Improving our capacity to integrate large-scale structural and functional information may be required to substantively advance our understanding of the general principles by which a protein’s structure determines its function.

[1]  Holger Gohlke,et al.  DrugScorePPI Knowledge-Based Potentials Used as Scoring and Objective Function in Protein-Protein Docking , 2014, PloS one.

[2]  David S. Goodsell,et al.  The RCSB Protein Data Bank: views of structural biology for basic and applied research and education , 2014, Nucleic Acids Res..

[3]  Michael Nilges,et al.  Flexibility and conformational entropy in protein-protein binding. , 2006, Structure.

[4]  G. V. Paolini,et al.  Empirical scoring functions: I. The development of a fast empirical scoring function to estimate the binding affinity of ligands in receptor complexes , 1997, J. Comput. Aided Mol. Des..

[5]  P. Kastritis,et al.  On the binding affinity of macromolecular interactions: daring to ask why proteins interact , 2013, Journal of The Royal Society Interface.

[6]  Michele Vendruscolo,et al.  Protein structure determination from NMR chemical shifts , 2007, Proceedings of the National Academy of Sciences.

[7]  Walter Filgueira de Azevedo,et al.  Evaluation of molecular docking using polynomial empirical scoring functions. , 2008, Current drug targets.

[8]  Yong Duan,et al.  Distinguish protein decoys by Using a scoring function based on a new AMBER force field, short molecular dynamics simulations, and the generalized born solvent model , 2004, Proteins.

[9]  Hans Bräuner-Osborne,et al.  Novel high-affinity and selective biaromatic 4-substituted gamma-hydroxybutyric acid (GHB) analogues as GHB ligands: design, synthesis, and binding studies. , 2008, Journal of medicinal chemistry.

[10]  Naomi E Chayen,et al.  Two Independent Histidines, One in Human Prolactin and One in Its Receptor, Are Critical for pH-dependent Receptor Recognition and Activation* , 2010, The Journal of Biological Chemistry.

[11]  Miguel A. Andrade-Navarro,et al.  Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.

[12]  Z. Weng,et al.  A structure‐based benchmark for protein–protein binding affinity , 2011, Protein science : a publication of the Protein Society.

[13]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[14]  Urban Bren,et al.  Do all pieces make a whole? Thiele cumulants and the free energy decomposition , 2007 .

[15]  G. Biggio,et al.  Novel 2-phenylimidazo[1,2-a]pyridine derivatives as potent and selective ligands for peripheral benzodiazepine receptors: synthesis, binding affinity, and in vivo studies. , 1999, Journal of medicinal chemistry.

[16]  Zhihai Liu,et al.  Comparative Assessment of Scoring Functions on a Diverse Test Set , 2009, J. Chem. Inf. Model..

[17]  Hei-Chia Wang,et al.  Using positive and negative patterns to extract information from journal articles regarding the regulation of a target gene by a transcription factor , 2013, Comput. Biol. Medicine.

[18]  Bin Zhang,et al.  Synthesis and binding affinity of novel mono- and bivalent morphinan ligands for κ, μ, and δ opioid receptors. , 2011, Bioorganic & medicinal chemistry.

[19]  C. Sander,et al.  Quality control of protein models : directional atomic contact analysis , 1993 .

[20]  Shinya Honda,et al.  Optimizing pH Response of Affinity between Protein G and IgG Fc , 2009, Journal of Biological Chemistry.

[21]  D. Lambright,et al.  Structural basis of family-wide Rab GTPase recognition by rabenosyn-5 , 2005, Nature.

[22]  References , 1971 .

[23]  Min Zhu,et al.  Protein-Protein Binding Affinity Prediction Based on an SVR Ensemble , 2012, ICIC.

[24]  J L Sussman,et al.  Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. , 1998, Acta crystallographica. Section D, Biological crystallography.

[25]  Chao Zhang,et al.  FastContact: rapid estimate of contact and binding free energies , 2005, Bioinform..

[26]  Xin Wen,et al.  BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities , 2006, Nucleic Acids Res..

[27]  S Cusack,et al.  Crystal structure of the human nuclear cap binding complex. , 2001, Molecular cell.

[28]  M. H. Patrick,et al.  Crystallography made crystal clear: A guide for users of macromolecular models (3rd Ed.) , 2007 .

[29]  Roberto Reverberi,et al.  Factors affecting the antigen-antibody reaction. , 2007, Blood transfusion = Trasfusione del sangue.

[30]  Gerhard Klebe,et al.  SFCscore: Scoring functions for affinity prediction of protein–ligand complexes , 2008, Proteins.

[31]  J. McKimm-Breschkin,et al.  Real Time Enzyme Inhibition Assays Provide Insights into Differences in Binding of Neuraminidase Inhibitors to Wild Type and Mutant Influenza Viruses , 2011, PloS one.

[32]  Robert M. Sweet Outline of Crystallography for Biologists. By David Blow. Oxford University Press, 2002. Price GBP 25 (paperback). ISBN-0-19-851051-9. , 2003 .

[33]  Luhua Lai,et al.  Further development and validation of empirical scoring functions for structure-based binding affinity prediction , 2002, J. Comput. Aided Mol. Des..

[34]  Nikolay V. Dokholyan,et al.  Combined Application of Cheminformatics- and Physical Force Field-Based Scoring Functions Improves Binding Affinity Prediction for CSAR Data Sets , 2011, J. Chem. Inf. Model..

[35]  Chee Keong Kwoh,et al.  CScore: a simple yet effective scoring function for protein-ligand binding affinity prediction using modified CMAC learning architecture. , 2011, Journal of bioinformatics and computational biology.

[36]  Walter Filgueira de Azevedo,et al.  Evaluation of ligand-binding affinity using polynomial empirical scoring functions. , 2008, Bioorganic & medicinal chemistry.

[37]  C S Raman,et al.  Isothermal titration calorimetry of protein-protein interactions. , 1999, Methods.

[38]  G. W. Small Spectrometric Identification of Organic Compounds , 1992 .

[39]  Garry A. Rechnitz,et al.  Enzyme Inhibition Assays with an Amperometric Glucose Biosensor Based on a Thiolate Self‐Assembled Monolayer , 2000 .

[40]  Rodrigo C. Barros,et al.  Clustering Molecular Dynamics Trajectories for Optimizing Docking Experiments , 2015, Comput. Intell. Neurosci..

[41]  R W Harrison,et al.  The effect of temperature and binding kinetics on the competitive binding assay of steroid potency in intact AtT-20 cells and cytosol. , 1980, The Journal of biological chemistry.

[42]  David Blow,et al.  Outline of Crystallography for Biologists , 2002 .

[43]  C. Venkatachalam,et al.  LigScore: a novel scoring function for predicting binding affinities. , 2005, Journal of molecular graphics & modelling.

[44]  Arthur J. Olson,et al.  Robust Scoring Functions for Protein-Ligand Interactions with Quantum Chemical Charge Models , 2011, J. Chem. Inf. Model..

[45]  Nihar R. Mahapatra,et al.  BgN-Score and BsN-Score: Bagging and boosting based ensemble neural networks scoring functions for accurate binding affinity prediction of protein-ligand complexes , 2015, BMC Bioinformatics.

[46]  Michel Gillard,et al.  Changes in pH differently affect the binding properties of histamine H1 receptor antagonists. , 2006, European journal of pharmacology.

[47]  M. Auer,et al.  Enzyme inhibition assays using fluorescence correlation spectroscopy: a new algorithm for the derivation of kcat/KM and Ki values at substrate concentrations much lower than the Michaelis constant. , 2000, Biochemistry.

[48]  R. Nussinov,et al.  Principles of protein-protein interactions: what are the preferred ways for proteins to interact? , 2008, Chemical reviews.

[49]  Xi Song,et al.  Crystal structures of interleukin 17A and its complex with IL-17 receptor A , 2013, Nature Communications.

[50]  Yoshua Bengio,et al.  Model Selection for Small Sample Regression , 2002, Machine Learning.

[51]  Joseph Beyene,et al.  Determining relative importance of variables in developing and validating predictive models , 2009, BMC medical research methodology.

[52]  Urban Bren,et al.  Decomposition of the solvation free energies of deoxyribonucleoside triphosphates using the free energy perturbation method. , 2006, The journal of physical chemistry. B.

[53]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[54]  A. Gayen,et al.  The frequency distribution of the product-moment correlation coefficient in random samples of any size drawn from non-normal universes. , 1951, Biometrika.

[55]  Andreas Lingel,et al.  Hedgehog Pathway Antagonist 5E1 Binds Hedgehog at the Pseudo-active Site , 2010, The Journal of Biological Chemistry.

[56]  Anthony Nicholls,et al.  Essential considerations for using protein-ligand structures in drug discovery. , 2012, Drug discovery today.

[57]  C. Sander,et al.  Errors in protein structures , 1996, Nature.

[58]  Rebecca L Rich,et al.  Higher-throughput, label-free, real-time molecular interaction analysis. , 2007, Analytical biochemistry.

[59]  P. Adamson,et al.  Dihydrofolate reductase enzyme inhibition assay for plasma methotrexate determination using a 96-well microplate reader. , 1999, Clinical chemistry.

[60]  Wallace Wurth,et al.  Fundamentals of Biochemistry: , 1936, Nature.

[61]  D. Nicolau,et al.  The BAD project: data mining, database and prediction of protein adsorption on surfaces. , 2009, Lab on a chip.

[62]  A. Bonvin,et al.  The HADDOCK web server for data-driven biomolecular docking , 2010, Nature Protocols.

[63]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[64]  Michael J E Sternberg,et al.  The Phyre2 web portal for protein modeling, prediction and analysis , 2015, Nature Protocols.

[65]  Raquel Dias,et al.  Different combinations of atomic interactions predict protein‐small molecule and protein‐DNA/RNA affinities with similar accuracy , 2015, Proteins.

[66]  Zhihai Liu,et al.  A knowledge-guided strategy for improving the accuracy of scoring functions in binding affinity prediction , 2010, BMC Bioinformatics.

[67]  T. Hianik,et al.  Influence of ionic strength, pH and aptamer configuration for binding affinity to thrombin. , 2007, Bioelectrochemistry.

[68]  F. Allain,et al.  Molecular basis for the wide range of affinity found in Csr/Rsm protein–RNA recognition , 2014, Nucleic acids research.

[69]  Jin Wang,et al.  Specificity and affinity quantification of protein-protein interactions , 2013, Bioinform..

[70]  Mahua Ghosh,et al.  The Nuclease A-Inhibitor Complex Is Characterized by a Novel Metal Ion Bridge* , 2007, Journal of Biological Chemistry.

[71]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[72]  Renxiao Wang,et al.  The PDBbind database: methodologies and updates. , 2005, Journal of medicinal chemistry.

[73]  Michal Brylinski,et al.  Nonlinear Scoring Functions for Similarity-Based Ligand Docking and Binding Affinity Prediction , 2013, J. Chem. Inf. Model..

[74]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[75]  Alexandre M J J Bonvin,et al.  Are scoring functions in protein-protein docking ready to predict interactomes? Clues from a novel binding affinity benchmark. , 2010, Journal of proteome research.

[76]  Thomas C. Südhof,et al.  Structures of Neuroligin-1 and the Neuroligin-1/Neurexin-1β Complex Reveal Specific Protein-Protein and Protein-Ca2+ Interactions , 2007, Neuron.