Applying the Naïve Bayes classifier with kernel density estimation to the prediction of protein-protein interaction sites

MOTIVATION The limited availability of protein structures often restricts the functional annotation of proteins and the identification of their protein-protein interaction sites. Computational methods to identify interaction sites from protein sequences alone are, therefore, required for unraveling the functions of many proteins. This article describes a new method (PSIVER) to predict interaction sites, i.e. residues binding to other proteins, in protein sequences. Only sequence features (position-specific scoring matrix and predicted accessibility) are used for training a Naïve Bayes classifier (NBC), and conditional probabilities of each sequence feature are estimated using a kernel density estimation method (KDE). RESULTS The leave-one out cross-validation of PSIVER achieved a Matthews correlation coefficient (MCC) of 0.151, an F-measure of 35.3%, a precision of 30.6% and a recall of 41.6% on a non-redundant set of 186 protein sequences extracted from 105 heterodimers in the Protein Data Bank (consisting of 36 219 residues, of which 15.2% were known interface residues). Even though the dataset used for training was highly imbalanced, a randomization test demonstrated that the proposed method managed to avoid overfitting. PSIVER was also tested on 72 sequences not used in training (consisting of 18 140 residues, of which 10.6% were known interface residues), and achieved an MCC of 0.135, an F-measure of 31.5%, a precision of 25.0% and a recall of 46.5%, outperforming other publicly available servers tested on the same dataset. PSIVER enables experimental biologists to identify potential interface residues in unknown proteins from sequence information alone, and to mutate those residues selectively in order to unravel protein functions. AVAILABILITY Freely available on the web at http://tardis.nibio.go.jp/PSIVER/

[1]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[2]  B. Lee,et al.  The interpretation of protein structures: estimation of static accessibility. , 1971, Journal of molecular biology.

[3]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[4]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[5]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[6]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[7]  S. Jones,et al.  Analysis of protein-protein interaction sites using surface patches. , 1997, Journal of molecular biology.

[8]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[9]  S. Jones,et al.  Prediction of protein-protein interaction sites using patch analysis. , 1997, Journal of molecular biology.

[10]  Chris Sander,et al.  The HSSP database of protein structure-sequence alignments and family profiles , 1998, Nucleic Acids Res..

[11]  A. Thomas,et al.  A fast method to predict protein interaction sites from sequences. , 2000, Journal of molecular biology.

[12]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[13]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[14]  Huan‐Xiang Zhou,et al.  Prediction of protein interaction sites from sequence profile and residue neighbor list , 2001, Proteins.

[15]  A. Valencia,et al.  Prediction of protein--protein interaction sites in heterocomplexes with neural networks. , 2002, European journal of biochemistry.

[16]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[17]  B. Rost,et al.  Predicted protein–protein interaction sites from local sequence information , 2003, FEBS letters.

[18]  Zhiping Weng,et al.  A protein–protein docking benchmark , 2003, Proteins.

[19]  J. Thornton,et al.  Structural characterisation and functional significance of transient protein-protein interactions. , 2003, Journal of molecular biology.

[20]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[21]  Vasant Honavar,et al.  A two-stage classifier for identification of protein-protein interface residues , 2004, ISMB/ECCB.

[22]  Zsuzsanna Dosztányi,et al.  Transmembrane proteins in the Protein Data Bank: identification and classification , 2004, Bioinform..

[23]  R. Raz,et al.  ProMate: a structure based prediction program to identify the location of protein-protein binding sites. , 2004, Journal of molecular biology.

[24]  Olivier Lichtarge,et al.  BIOINFORMATICS ORIGINAL PAPER Systems biology , 2004 .

[25]  Aleksey A. Porollo,et al.  Linear Regression Models for Solvent Accessibility Prediction in Proteins , 2005, J. Comput. Biol..

[26]  Vasant Honavar,et al.  Predicting DNA-binding sites of proteins from amino acid sequence , 2006, BMC Bioinformatics.

[27]  Aleksey A. Porollo,et al.  Combining prediction of secondary structure and solvent accessibility in proteins , 2005, Proteins.

[28]  R. Abagyan,et al.  Optimal docking area: A new method for predicting protein–protein interaction sites , 2004, Proteins.

[29]  Z. Weng,et al.  Protein–protein docking benchmark 2.0: An update , 2005, Proteins.

[30]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[31]  Ziv Bar-Joseph,et al.  Evaluation of different biological data and computational classification methods for use in protein interaction prediction , 2006, Proteins.

[32]  Richard M. Jackson,et al.  Predicting protein interaction sites: binding hot-spots in protein-protein and protein-ligand interfaces , 2006, Bioinform..

[33]  Peng Chen,et al.  Predicting protein interaction sites from residue spatial sequence profile and evolution rate , 2006, FEBS Letters.

[34]  Burkhard Rost,et al.  ISIS: interaction sites identified from sequence , 2007, Bioinform..

[35]  Interaction-site prediction for protein complexes: a critical assessment , 2007, Bioinform..

[36]  Aleksey A. Porollo,et al.  Prediction‐based fingerprints of protein–protein interactions , 2006, Proteins.

[37]  Jae-Hyung Lee,et al.  RNABindR: a server for analyzing and predicting RNA-binding sites in proteins , 2007, Nucleic Acids Res..

[38]  Z. Weng,et al.  Protein–protein docking benchmark version 3.0 , 2008, Proteins.

[39]  R. Russell,et al.  Targeting and tinkering with interaction networks. , 2008, Nature chemical biology.

[40]  Xue-wen Chen,et al.  Sequence-based prediction of protein interaction sites with an integrative method , 2009, Bioinform..

[41]  Kristian Vlahovicek,et al.  Prediction of Protein–Protein Interaction Sites in Sequences and 3D Structures by Random Forests , 2009, PLoS Comput. Biol..

[42]  Alfonso Valencia,et al.  Progress and challenges in predicting protein-protein interaction sites , 2008, Briefings Bioinform..

[43]  Zhiping Weng,et al.  Protein–protein docking benchmark version 4.0 , 2010, Proteins.

[44]  J. Meller,et al.  Computational Methods for Prediction of Protein-Protein Interaction Sites , 2012 .