A graph kernel approach for alignment-free domain–peptide interaction prediction with an application to human SH3 domains

Motivation: State-of-the-art experimental data for determining binding specificities of peptide recognition modules (PRMs) is obtained by high-throughput approaches like peptide arrays. Most prediction tools applicable to this kind of data are based on an initial multiple alignment of the peptide ligands. Building an initial alignment can be error-prone, especially in the case of the proline-rich peptides bound by the SH3 domains. Results: Here, we present a machine-learning approach based on an efficient graph-kernel technique to predict the specificity of a large set of 70 human SH3 domains, which are an important class of PRMs. The graph-kernel strategy allows us to (i) integrate several types of physico-chemical information for each amino acid, (ii) consider high-order correlations between these features and (iii) eliminate the need for an initial peptide alignment. We build specialized models for each human SH3 domain and achieve competitive predictive performance of 0.73 area under precision-recall curve, compared with 0.27 area under precision-recall curve for state-of-the-art methods based on position weight matrices. We show that better models can be obtained when we use information on the noninteracting peptides (negative examples), which is currently not used by the state-of-the art approaches based on position weight matrices. To this end, we analyze two strategies to identify subsets of high confidence negative data. The techniques introduced here are more general and hence can also be used for any other protein domains, which interact with short peptides (i.e. other PRMs). Availability: The program with the predictive models can be found at http://www.bioinf.uni-freiburg.de/Software/SH3PepInt/SH3PepInt.tar.gz. We also provide a genome-wide prediction for all 70 human SH3 domains, which can be found under http://www.bioinf.uni-freiburg.de/Software/SH3PepInt/Genome-Wide-Predictions.tar.gz. Contact: backofen@informatik.uni-freiburg.de Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[2]  Michele Magrane,et al.  UniProt Knowledgebase: a hub of integrated protein data , 2011, Database J. Biol. Databases Curation.

[3]  S. Li Specificity and versatility of SH3 and other proline-recognition domains: structural basis and implications for cellular signal transduction. , 2005, The Biochemical journal.

[4]  Morten Nielsen,et al.  Simultaneous alignment and clustering of peptide data using a Gibbs sampling approach , 2013, Bioinform..

[5]  L. Castagnoli,et al.  Protein Interaction Networks by Proteome Peptide Scanning , 2004, PLoS biology.

[6]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[7]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[8]  Rolf Backofen,et al.  GraphClust: alignment-free structural clustering of local RNA secondary structures , 2012, Bioinform..

[9]  Andrea Musacchio,et al.  Crystal structure of a Src-homology 3 (SH3) domain , 1992, Nature.

[10]  M. Helmer-Citterich,et al.  SH3-SPOT: an algorithm to predict preferred ligands to different members of the SH3 gene family. , 2000, Journal of molecular biology.

[11]  Brett W. Engelmann,et al.  SH2 Domains Recognize Contextual Peptide Sequence Information to Determine Selectivity* , 2010, Molecular & Cellular Proteomics.

[12]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[13]  I. Jurisica,et al.  Systematic identification of SH3 domain‐mediated human protein–protein interactions by peptide array target screening , 2007, Proteomics.

[14]  Maria Victoria Schneider,et al.  MINT: a Molecular INTeraction database. , 2002, FEBS letters.

[15]  P. Pelicci,et al.  Constitutive phosphorylation of eps8 in tumor cell lines: relevance to malignant transformation , 1995, Molecular and cellular biology.

[16]  Gary D Bader,et al.  The multiple-specificity landscape of modular peptide recognition domains. , 2011 .

[17]  G. Michailidis,et al.  An Iterative Algorithm for Extending Learners to a Semi-Supervised Setting , 2008 .

[19]  Gary D. Bader,et al.  MUSI: an integrated system for identifying multiple specificity from very large peptide or nucleic acid data sets , 2011, Nucleic acids research.

[20]  Andrea Musacchio,et al.  A novel peptide–SH3 interaction , 1999, The EMBO journal.

[21]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[22]  S. Schreiber,et al.  Two binding orientations for peptides to the Src SH3 domain: development of a general model for SH3-ligand interactions. , 1995, Science.

[23]  S. Schreiber,et al.  Specific interactions outside the proline-rich core of two classes of Src homology 3 ligands. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Enrico Ferraro,et al.  A novel structure-based encoding for machine-learning applied to the inference of SH3 domain specificity , 2006, Bioinform..

[25]  Ivan Dikic,et al.  Atypical Polyproline Recognition by the CMS N-terminal Src Homology 3 Domain* , 2006, Journal of Biological Chemistry.

[26]  B. Mayer,et al.  SH3 domains: complexity in moderation. , 2001, Journal of cell science.

[27]  L. Minichiello,et al.  Eps8, a substrate for the epidermal growth factor receptor kinase, enhances EGF‐dependent mitogenic signals. , 1993, The EMBO journal.

[28]  R. Wagner,et al.  Reciprocal Regulation of SH3 and SH2 Domain Binding via Tyrosine Phosphorylation of a Common Site in CD3ε1 , 2007, The Journal of Immunology.

[29]  Hongtao Yu,et al.  Structural basis for the binding of proline-rich peptides to SH3 domains , 1994, Cell.

[30]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[31]  M. Shipston,et al.  A noncanonical SH3 domain binding motif links BK channels to the actin cytoskeleton via the SH3 adapter cortactin , 2006, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[32]  Ken Chen,et al.  Computational Analysis and Prediction of the Binding Motif and Protein Interacting Partners of the Abl SH3 Domain , 2006, PLoS Comput. Biol..

[33]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[34]  Raffi Tonikian,et al.  Identifying specificity profiles for peptide recognition modules from phage-displayed peptide libraries , 2007, Nature Protocols.

[35]  Damian Szklarczyk,et al.  STRING v9.1: protein-protein interaction networks, with increased coverage and integration , 2012, Nucleic Acids Res..

[36]  K. Miyazawa,et al.  A Deubiquitinating Enzyme UBPY Interacts with the Src Homology 3 Domain of Hrs-binding Protein via a Novel Binding Motif PX(V/I)(D/N)RXXKP* , 2000, The Journal of Biological Chemistry.

[37]  William Stafford Noble,et al.  Choosing negative examples for the prediction of protein-protein interactions , 2006, BMC Bioinformatics.

[38]  Michael Gribskov,et al.  Combining evidence using p-values: application to sequence homology searches , 1998, Bioinform..

[39]  Livia Perfetto,et al.  The protein interaction network mediated by human SH3 domains. , 2012, Biotechnology advances.

[40]  Fabrizio Costa,et al.  Fast Neighborhood Subgraph Pairwise Distance Kernel , 2010, ICML.

[41]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[42]  Livia Perfetto,et al.  MINT, the molecular interaction database: 2012 update , 2011, Nucleic Acids Res..

[43]  Gianni Cesareni,et al.  Can we infer peptide recognition specificity mediated by SH3 domains? , 2002, FEBS letters.

[44]  Yu Zong Chen,et al.  prediction of protein-protein interactions , 2004 .

[45]  Gary D. Bader,et al.  Proteome scanning to predict PDZ domain interactions using support vector machines , 2010, BMC Bioinformatics.

[46]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[47]  Wendell A. Lim,et al.  Structural determinants of peptide-binding orientation and of sequence specificity in SH3 domains , 1995, Nature.

[48]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[49]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[50]  Tony Pawson,et al.  Structural basis for specific binding of the Gads SH3 domain to an RxxK motif-containing SLP-76 peptide: a novel mode of peptide recognition. , 2003, Molecular cell.

[51]  Michael Liss,et al.  Identification of preferred protein interactions by phage‐display of the human Src homology‐3 proteome , 2006, EMBO reports.