A novel structure-based encoding for machine-learning applied to the inference of SH3 domain specificity

MOTIVATION Unravelling the rules underlying protein-protein and protein-ligand interactions is a crucial step in understanding cell machinery. Peptide recognition modules (PRMs) are globular protein domains which focus their binding targets on short protein sequences and play a key role in the frame of protein-protein interactions. High-throughput techniques permit the whole proteome scanning of each domain, but they are characterized by a high incidence of false positives. In this context, there is a pressing need for the development of in silico experiments to validate experimental results and of computational tools for the inference of domain-peptide interactions. RESULTS We focused on the SH3 domain family and developed a machine-learning approach for inferring interaction specificity. SH3 domains are well-studied PRMs which typically bind proline-rich short sequences characterized by the PxxP consensus. The binding information is known to be held in the conformation of the domain surface and in the short sequence of the peptide. Our method relies on interaction data from high-throughput techniques and benefits from the integration of sequence and structure data of the interacting partners. Here, we propose a novel encoding technique aimed at representing binding information on the basis of the domain-peptide contact residues in complexes of known structure. Remarkably, the new encoding requires few variables to represent an interaction, thus avoiding the 'curse of dimension'. Our results display an accuracy >90% in detecting new binders of known SH3 domains, thus outperforming neural models on standard binary encodings, profile methods and recent statistical predictors. The method, moreover, shows a generalization capability, inferring specificity of unknown SH3 domains displaying some degree of similarity with the known data.

[1]  Andrea Musacchio,et al.  How SH3 domains recognize proline. , 2002, Advances in protein chemistry.

[2]  S. Schreiber,et al.  Two binding orientations for peptides to the Src SH3 domain: development of a general model for SH3-ligand interactions. , 1995, Science.

[3]  Dirk Husmeier,et al.  A regularized discriminative model for the prediction of protein-peptide interactions , 2006, Bioinform..

[4]  Arun K. Ramani,et al.  Protein interaction networks from yeast to human. , 2004, Current opinion in structural biology.

[5]  M. Sudol,et al.  The importance of being proline: the interaction of proline‐rich motifs in signaling proteins with their cognate domains , 2000, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[6]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[7]  P. Bork,et al.  Functional organization of the yeast proteome by systematic analysis of protein complexes , 2002, Nature.

[8]  Gary D Bader,et al.  A Combined Experimental and Computational Strategy to Define Protein Interaction Networks for Peptide Recognition Modules , 2001, Science.

[9]  M. Helmer-Citterich,et al.  SH3-SPOT: an algorithm to predict preferred ligands to different members of the SH3 gene family. , 2000, Journal of molecular biology.

[10]  Mark D'Souza,et al.  Use of contiguity on the chromosome to predict functional coupling , 1998, Silico Biol..

[11]  S. Henikoff,et al.  Embedding strategies for effective use of information from multiple sequence alignments , 1997, Protein science : a publication of the Protein Society.

[12]  A. Sparks,et al.  Distinct ligand preferences of Src homology 3 domains from Src, Yes, Abl, Cortactin, p53bp2, PLCgamma, Crk, and Grb2. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[13]  T. Ito,et al.  Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Frank Alber,et al.  A structural perspective on protein-protein interactions. , 2004, Current opinion in structural biology.

[15]  Yingdong Zhao,et al.  Application of support vector machines for T-cell epitopes prediction , 2003, Bioinform..

[16]  Loris Nanni,et al.  An ensemble of K-local hyperplanes for predicting protein-protein interactions , 2006, Bioinform..

[17]  B. Mayer,et al.  SH3 domains: complexity in moderation. , 2001, Journal of cell science.

[18]  Gianni Cesareni,et al.  Can we infer peptide recognition specificity mediated by SH3 domains? , 2002, FEBS letters.

[19]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[20]  J. Skolnick,et al.  Application of an artificial neural network to predict specific class I MHC binding peptide sequences , 1998, Nature Biotechnology.

[21]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[22]  L. Castagnoli,et al.  Protein Interaction Networks by Proteome Peptide Scanning , 2004, PLoS biology.

[23]  Manuela Helmer-Citterich,et al.  iSPOT: a web tool to infer the interaction specificity of families of protein modules , 2003, Nucleic Acids Res..

[24]  A. Valencia,et al.  Similarity of phylogenetic trees as indicator of protein-protein interaction. , 2001, Protein engineering.

[25]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[26]  Alfonso Valencia,et al.  Computational methods for the prediction of protein interaction partners , 2004 .

[27]  A. Valencia,et al.  Correlated mutations contain information about protein-protein interaction. , 1997, Journal of molecular biology.

[28]  M. Sternberg Protein Structure Prediction: A Practical Approach , 1997 .

[29]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[30]  Benno Schwikowski,et al.  Predicting protein-peptide interactions via a network-based motif sampler , 2004, ISMB/ECCB.

[31]  F. Cohen,et al.  Co-evolution of proteins with their interaction partners. , 2000, Journal of molecular biology.

[32]  A. Valencia,et al.  In silico two‐hybrid system for the selection of physically interacting protein pairs , 2002, Proteins.

[33]  A Sali,et al.  Modeling mutations and homologous proteins. , 1995, Current opinion in biotechnology.

[34]  T. Gaasterland,et al.  Microbial genescapes: phyletic and functional patterns of ORF distribution among prokaryotes. , 1998, Microbial & comparative genomics.

[35]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[36]  Wendell A. Lim,et al.  Structural determinants of peptide-binding orientation and of sequence specificity in SH3 domains , 1995, Nature.

[37]  Marius Sudol,et al.  From Src Homology domains to other signaling modules: proposal of the `protein recognition code' , 1998, Oncogene.

[38]  Cathy H. Wu Artificial Neural Networks for Molecular Sequence Analysis , 1997, Comput. Chem..

[39]  James R. Knight,et al.  A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae , 2000, Nature.

[40]  David A. Gough,et al.  Predicting protein-protein interactions from primary structure , 2001, Bioinform..

[41]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[42]  S. Li Specificity and versatility of SH3 and other proline-recognition domains: structural basis and implications for cellular signal transduction. , 2005, The Biochemical journal.

[43]  Jean-Loup Faulon,et al.  Predicting protein-protein interactions using signature products , 2005, Bioinform..

[44]  M. Snyder,et al.  Protein chip technology. , 2003, Current opinion in chemical biology.

[45]  A. Barron Approximation and Estimation Bounds for Artificial Neural Networks , 1991, COLT '91.

[46]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[47]  Gary D Bader,et al.  Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry , 2002, Nature.

[48]  A. Valencia,et al.  Computational methods for the prediction of protein interactions. , 2002, Current opinion in structural biology.

[49]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .