Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences

Compared to the available protein sequences of different organisms, the number of revealed protein–protein interactions (PPIs) is still very limited. So many computational methods have been developed to facilitate the identification of novel PPIs. However, the methods only using the information of protein sequences are more universal than those that depend on some additional information or predictions about the proteins. In this article, a sequence-based method is proposed by combining a new feature representation using auto covariance (AC) and support vector machine (SVM). AC accounts for the interactions between residues a certain distance apart in the sequence, so this method adequately takes the neighbouring effect into account. When performed on the PPI data of yeast Saccharomyces cerevisiae, the method achieved a very promising prediction result. An independent data set of 11 474 yeast PPIs was used to evaluate this prediction model and the prediction accuracy is 88.09%. The performance of this method is superior to those of the existing sequence-based methods, so it can be a useful supplementary tool for future proteomics studies. The prediction software and all data sets used in this article are freely available at http://www.scucic.cn/Predict_PPI/index.htm.

[1]  C. Tanford Contribution of Hydrophobic Interactions to the Stability of the Globular Conformation of Proteins , 1962 .

[2]  R. Grantham Amino Acid Difference Formula to Help Explain Protein Evolution , 1974, Science.

[3]  A. Komoriya,et al.  Local interactions as a structure determinant for protein molecules: III. , 1979, Biochimica et biophysica acta.

[4]  A. Komoriya,et al.  Local interactions as a structure determinant for protein molecules: II. , 1979, Biochimica et biophysica acta.

[5]  K. R. Woods,et al.  Prediction of protein antigenic determinants from amino acid sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[6]  M. Charton,et al.  The structural dependence of amino acid hydrophobicity parameters. , 1982, Journal of theoretical biology.

[7]  G. Rose,et al.  Hydrophobicity of amino acid residues in globular proteins. , 1985, Science.

[8]  S. Fields,et al.  A novel genetic system to detect protein–protein interactions , 1989, Nature.

[9]  Ina Ruck,et al.  USA , 1969, The Lancet.

[10]  S. Wold,et al.  DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures , 1993 .

[11]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[12]  Peter Winkler,et al.  Shuffling Biological Sequences , 1996, Discret. Appl. Math..

[13]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[14]  Mark D'Souza,et al.  Use of contiguity on the chromosome to predict functional coupling , 1998, Silico Biol..

[15]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[16]  Eivind Coward,et al.  Shufflet: shuffling sequences while conserving the k-let counts , 1999, Bioinform..

[17]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[19]  Ioannis Xenarios,et al.  DIP: the Database of Interacting Proteins , 2000, Nucleic Acids Res..

[20]  E. Sprinzak,et al.  Correlated sequence-signatures as markers of protein-protein interaction. , 2001, Journal of molecular biology.

[21]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[22]  David A. Gough,et al.  Predicting protein-protein interactions from primary structure , 2001, Bioinform..

[23]  R. Ozawa,et al.  A comprehensive two-hybrid analysis to explore the yeast protein interactome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[24]  M. Gerstein,et al.  Global Analysis of Protein Activities Using Proteome Chips , 2001, Science.

[25]  Gary D Bader,et al.  Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry , 2002, Nature.

[26]  Ioannis Xenarios,et al.  DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions , 2002, Nucleic Acids Res..

[27]  P. Bork,et al.  Functional organization of the yeast proteome by systematic analysis of protein complexes , 2002, Nature.

[28]  C. Deane,et al.  Protein Interactions , 2002, Molecular & Cellular Proteomics.

[29]  Patrick Aloy,et al.  Interrogating protein interaction networks through structural biology , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Wan Kyu Kim,et al.  Large scale statistical prediction of protein-protein interaction by potentially interacting domain (PID) pair. , 2002, Genome informatics. International Conference on Genome Informatics.

[31]  Guo-Ping Zhou,et al.  Subcellular location prediction of apoptosis proteins , 2002, Proteins.

[32]  Robert B. Russell,et al.  InterPreTS: protein Interaction Prediction through Tertiary Structure , 2003, Bioinform..

[33]  Cheng-Yan Kao,et al.  POINT: a database for the prediction of protein-protein interactions based on the orthologous interactome , 2004, Bioinform..

[34]  Ying Huang,et al.  Prediction of protein subcellular locations using fuzzy k-NN method , 2004, Bioinform..

[35]  Dong-Soo Han,et al.  PreSPI: a domain combination based prediction system for protein-protein interaction. , 2004, Nucleic acids research.

[36]  Jean-Loup Faulon,et al.  Predicting protein-protein interactions using signature products , 2005, Bioinform..

[37]  Baldomero Oliva,et al.  Prediction of protein-protein interactions using distant conservation of sequence patterns and structure relationships , 2005, Bioinform..

[38]  Yu Zong Chen,et al.  prediction of protein-protein interactions , 2004 .

[39]  M. Vidal,et al.  Effect of sampling on topology predictions of protein-protein interaction networks , 2005, Nature Biotechnology.

[40]  Ozlem Keskin,et al.  PRISM: protein interactions by structural matching , 2005, Nucleic Acids Res..

[41]  Albert Chan,et al.  PIPE: a protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs , 2006, BMC Bioinformatics.

[42]  Scott Dick,et al.  Classifier ensembles for protein structural class prediction with varying homology. , 2006, Biochemical and biophysical research communications.

[43]  Sukanta Mondal,et al.  Pseudo amino acid composition and multi-class support vector machines approach for conotoxin superfamily classification. , 2006, Journal of theoretical biology.

[44]  Mudita Singhal,et al.  A domain-based approach to predict protein-protein interactions , 2007, BMC Bioinformatics.

[45]  Yanda Li,et al.  Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence , 2006, BMC Bioinformatics.

[46]  William Stafford Noble,et al.  Choosing negative examples for the prediction of protein-protein interactions , 2006, BMC Bioinformatics.

[47]  Xiangjun Liu,et al.  GNBSL: A new integrative system to predict the subcellular location for Gram‐negative bacteria proteins , 2006, Proteomics.

[48]  G. Li,et al.  Classifying G protein-coupled receptors and nuclear receptors on the basis of protein power spectrum from fast Fourier transform , 2006, Amino Acids.

[49]  F. Tan,et al.  Prediction of mitochondrial proteins based on genetic algorithm – partial least squares and support vector machine , 2007, Amino Acids.

[50]  Irini A. Doytchinova,et al.  BMC Bioinformatics BioMed Central Methodology article VaxiJen: a server for prediction of protective antigens, tumour , 2007 .

[51]  Menglong Li,et al.  Predicting G‐protein coupled receptors–G‐protein coupling specificity based on autocross‐covariance transform , 2006, Proteins.

[52]  Desmond J. Higham,et al.  A lock-and-key model for protein-protein interactions , 2006, Bioinform..

[53]  K. Chou,et al.  Predicting protein-protein interactions from sequences in a hybridization space. , 2006, Journal of proteome research.

[54]  Wu,et al.  Genetic algorithm-base virtual screening of combinative mode for peptide/protein , 2006 .

[55]  Z. Wen,et al.  Delaunay triangulation with partial least squares projection to latent structures: a model for G-protein coupled receptors classification and fast structure recognition , 2007, Amino Acids.

[56]  Z. Wen,et al.  Using pseudo amino acid composition to predict transmembrane regions in protein: cellular automata and Lempel-Ziv complexity , 2007, Amino Acids.

[57]  Ponnuthurai N. Suganthan,et al.  A machine learning approach for the identification of odorant binding proteins from sequence-derived properties , 2007, BMC Bioinformatics.

[58]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[59]  Tongliang Zhang,et al.  Using pseudo amino acid composition and binary-tree support vector machines to predict protein structural classes , 2007, Amino Acids.

[60]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.

[61]  K. Chou,et al.  Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms , 2008, Nature Protocols.