Regression applied to protein binding site prediction and comparison with classification

BackgroundThe structural genomics centers provide hundreds of protein structures of unknown function. Therefore, developing methods enabling the determination of a protein function automatically is imperative. The determination of a protein function can be achieved by studying the network of its physical interactions. In this context, identifying a potential binding site between proteins is of primary interest. In the literature, methods for predicting a potential binding site location generally are based on classification tools. The aim of this paper is to show that regression tools are more efficient than classification tools for patches based binding site predictors. For this purpose, we developed a patches based binding site localization method usable with either regression or classification tools.ResultsWe compared predictive performances of regression tools with performances of machine learning classifiers. Using leave-one-out cross-validation, we showed that regression tools provide better predictions than classification ones. Among regression tools, Multilayer Perceptron ranked highest in the quality of predictions. We compared also the predictive performance of our patches based method using Multilayer Perceptron with the performance of three other methods usable through a web server. Our method performed similarly to the other methods.ConclusionRegression is more efficient than classification when applied to our binding site localization method. When it is possible, using regression instead of classification for other existing binding site predictors will probably improve results. Furthermore, the method presented in this work is flexible because the size of the predicted binding site is adjustable. This adaptability is useful when either false positive or negative rates have to be limited.

[1]  Richard Goldstein,et al.  Regression Methods in Biostatistics: Linear, Logistic, Survival and Repeated Measures Models , 2006, Technometrics.

[2]  Itay Mayrose,et al.  ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures , 2005, Nucleic Acids Res..

[3]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[4]  David R. Westhead,et al.  Improved prediction of protein-protein binding sites using a support vector machines approach. , 2005, Bioinformatics.

[5]  J. Thornton,et al.  Predicting protein function from sequence and structural data. , 2005, Current opinion in structural biology.

[6]  R. Nussinov,et al.  Principles of protein-protein interactions: what are the preferred ways for proteins to interact? , 2008, Chemical reviews.

[7]  Z. Weng,et al.  Protein–protein docking benchmark 2.0: An update , 2005, Proteins.

[8]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[9]  Sung-Hou Kim,et al.  Overview of structural genomics: from structure to function. , 2003, Current opinion in chemical biology.

[10]  S. Jones,et al.  Principles of protein-protein interactions. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[11]  H. Wolfson,et al.  Recognition of Functional Sites in Protein Structures☆ , 2004, Journal of Molecular Biology.

[12]  Fan Jiang,et al.  Prediction of protein-protein binding site by using core interface residue and support vector machine , 2008, BMC Bioinformatics.

[13]  Cathy H. Wu,et al.  Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties , 2006, BMC Bioinformatics.

[14]  Yu-Long Xie,et al.  Evaluation of principal component selection methods to form a global prediction model by principal component regression , 1997 .

[15]  Huan-Xiang Zhou,et al.  Prediction of interface residues in protein–protein complexes by a consensus neural network method: Test against NMR data , 2005, Proteins.

[16]  Ozlem Keskin,et al.  A survey of available tools and web servers for analysis of protein-protein interactions and interfaces , 2008, Briefings Bioinform..

[17]  Harry Zhang,et al.  Naive Bayesian Classifiers for Ranking , 2004, ECML.

[18]  Michael Y. Hu,et al.  Forecasting with artificial neural networks: The state of the art , 1997 .

[19]  Benoit M. Macq,et al.  Fast and accurate travel depth estimation for protein active site prediction , 2008, Electronic Imaging.

[20]  K. Sharp,et al.  Travel depth, a new shape descriptor for macromolecules: application to ligand binding. , 2006, Journal of molecular biology.

[21]  J. Drews Drug discovery: a historical perspective. , 2000, Science.

[22]  J. Skolnick,et al.  Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and T1 ribonucleases. , 1998, Journal of molecular biology.

[23]  Ruth Nussinov,et al.  SiteEngines: recognition and comparison of binding sites and protein–protein interfaces , 2005, Nucleic Acids Res..

[24]  A. Bulpitt,et al.  Insights into protein-protein interfaces using a Bayesian network prediction method. , 2006, Journal of molecular biology.

[25]  Susan Jones,et al.  SHARP2: protein-protein interaction predictions using patch analysis , 2006, Bioinform..

[26]  A. Sali 100,000 protein structures for the biologist , 1998, Nature Structural Biology.

[27]  S. Jones,et al.  Prediction of protein-protein interaction sites using patch analysis. , 1997, Journal of molecular biology.

[28]  Huan-Xiang Zhou,et al.  meta-PPISP: a meta web server for protein-protein interaction site prediction , 2007, Bioinform..

[29]  Ruth Nussinov,et al.  The Multiple Common Point Set Problem and Its Application to Molecule Binding Pattern Detection , 2006, J. Comput. Biol..

[30]  Gail J. Bartlett,et al.  Analysis of catalytic residues in enzyme active sites. , 2002, Journal of molecular biology.

[31]  R. Brereton,et al.  Support vector machines for classification and regression. , 2010, The Analyst.

[32]  Ozlem Keskin,et al.  Similar binding sites and different partners: implications to shared proteins in cellular pathways. , 2007, Structure.

[33]  Ozlem Keskin,et al.  PRISM: protein-protein interaction prediction by structural matching. , 2008, Methods in molecular biology.

[34]  Huan‐Xiang Zhou,et al.  Prediction of protein interaction sites from sequence profile and residue neighbor list , 2001, Proteins.

[35]  Huan-Xiang Zhou,et al.  Interaction-site prediction for protein complexes: a critical assessment , 2007, Bioinform..

[36]  R. Tobias An Introduction to Partial Least Squares Regression , 1996 .

[37]  Charles E. McCulloch,et al.  Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models , 2005 .

[38]  Gail J. Bartlett,et al.  Using a neural network and spatial clustering to predict the location of active sites in enzymes. , 2003, Journal of molecular biology.

[39]  Eduardo Garcia Urdiales,et al.  Accurate Prediction of Peptide Binding Sites on Protein Surfaces , 2009, PLoS Comput. Biol..

[40]  Zhiping Weng,et al.  A protein–protein docking benchmark , 2003, Proteins.

[41]  Song Liu,et al.  Protein binding site prediction using an empirical scoring function , 2006, Nucleic acids research.