Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set

BackgroundIdentifying protein-protein interactions (PPIs) is essential for elucidating protein functions and understanding the molecular mechanisms inside the cell. However, the experimental methods for detecting PPIs are both time-consuming and expensive. Therefore, computational prediction of protein interactions are becoming increasingly popular, which can provide an inexpensive way of predicting the most likely set of interactions at the entire proteome scale, and can be used to complement experimental approaches. Although much progress has already been achieved in this direction, the problem is still far from being solved and new approaches are still required to overcome the limitations of the current prediction models.ResultsIn this work, a sequence-based approach is developed by combining a novel Multi-scale Continuous and Discontinuous (MCD) feature representation and Support Vector Machine (SVM). The MCD representation gives adequate consideration to the interactions between sequentially distant but spatially close amino acid residues, thus it can sufficiently capture multiple overlapping continuous and discontinuous binding patterns within a protein sequence. An effective feature selection method mRMR was employed to construct an optimized and more discriminative feature set by excluding redundant features. Finally, a prediction model is trained and tested based on SVM algorithm to predict the interaction probability of protein pairs.ConclusionsWhen performed on the yeast PPIs data set, the proposed approach achieved 91.36% prediction accuracy with 91.94% precision at the sensitivity of 90.67%. Extensive experiments are conducted to compare our method with the existing sequence-based method. Experimental results show that the performance of our predictor is better than several other state-of-the-art predictors, whose average prediction accuracy is 84.91%, sensitivity is 83.24%, and precision is 86.12%. Achieved results show that the proposed approach is very promising for predicting PPI, so it can be a useful supplementary tool for future proteomics studies. The source code and the datasets are freely available at http://csse.szu.edu.cn/staff/youzh/MCDPPI.zip for academic use.

[1]  Xing-Ming Zhao,et al.  A novel approach to extracting features from motif content and protein composition for protein sequence classification , 2005, Neural Networks.

[2]  Simon C. K. Shiu,et al.  Molecular Pattern Discovery Based on Penalized Matrix Decomposition , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Zhu-Hong You,et al.  Increasing reliability of protein interactome by fast manifold embedding , 2013, Pattern Recognit. Lett..

[4]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Keith C. C. Chan,et al.  Discovering Functional Interdependence Relationship in PPI Networks for Protein Complex Identification , 2012, IEEE Transactions on Biomedical Engineering.

[6]  Kuo-Chen Chou,et al.  Predict and analyze S-nitrosylation modification sites with the mRMR and IFS approaches. , 2012, Journal of proteomics.

[7]  Xiaobo Zhou,et al.  A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network , 2010, BMC Bioinformatics.

[8]  Loris Nanni,et al.  An ensemble of K-local hyperplanes for predicting protein-protein interactions , 2006, Bioinform..

[9]  P. Bork,et al.  Functional organization of the yeast proteome by systematic analysis of protein complexes , 2002, Nature.

[10]  K. Aihara,et al.  A discriminative approach for identifying domain–domain interactions from protein–protein interactions , 2010, Proteins.

[11]  Zhu-Hong You,et al.  Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data , 2010, Bioinform..

[12]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[13]  Peng Chen,et al.  Predicting protein interaction sites from residue spatial sequence profile and evolution rate , 2006, FEBS Letters.

[14]  Simon C. K. Shiu,et al.  Metasample-Based Sparse Representation for Tumor Classification , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  De-Shuang Huang,et al.  Normalized Feature Vectors: A Novel Alignment-Free Sequence Comparison Method Based on the Numbers of Adjacent Amino Acids , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Zhen Ji,et al.  Assessing and predicting protein interactions by combining manifold embedding with multiple information integration , 2012, BMC Bioinformatics.

[17]  S. Lewandowsky PLOS ONE 2013 , 2015 .

[18]  Jean-Loup Faulon,et al.  Predicting protein-protein interactions using signature products , 2005, Bioinform..

[19]  Zhen Ji,et al.  Large-Scale Protein-Protein Interactions Detection by Integrating Big Biosensing Data with Computational Model , 2014, BioMed research international.

[20]  Xin Li,et al.  Protein classification with imbalanced data , 2007, Proteins.

[21]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[22]  Xingming Zhao,et al.  Predicting protein–protein interactions from protein sequences using meta predictor , 2010, Amino Acids.

[23]  Kazuyuki Aihara,et al.  Protein function prediction with high-throughput data , 2008, Amino Acids.

[24]  De-Shuang Huang,et al.  Predicting protein–protein interactions from sequence using correlation coefficient and high-quality interaction dataset , 2010, Amino Acids.

[25]  Hongbin Shen,et al.  Large-scale prediction of human protein-protein interactions from amino acid sequence based on latent topic features. , 2010, Journal of proteome research.

[26]  Jie Gui,et al.  Prediction of protein-protein interactions from protein sequence using local descriptors. , 2010, Protein and peptide letters.

[27]  R. Ozawa,et al.  A comprehensive two-hybrid analysis to explore the yeast protein interactome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[28]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Zhu-Hong You,et al.  Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis , 2013, BMC Bioinformatics.

[30]  Wei Jia,et al.  Robust Classification Method of Tumor Subtype by Using Correlation Filters , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[31]  Sean R. Collins,et al.  Global landscape of protein complexes in the yeast Saccharomyces cerevisiae , 2006, Nature.

[32]  Loris Nanni,et al.  Hyperplanes for predicting protein-protein interactions , 2005, Neurocomputing.

[33]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[34]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.

[35]  David A. Gough,et al.  Whole-proteome interaction mining , 2003, Bioinform..

[36]  Kyungsook Han,et al.  Sequence-based prediction of protein-protein interactions by means of rotation forest and autocorrelation descriptor. , 2010, Protein and peptide letters.

[37]  Yun Gao,et al.  Prediction of Protein-Protein Interactions Using Local Description of Amino Acid Sequence , 2011 .

[38]  Benjamin A. Shoemaker,et al.  Deciphering Protein–Protein Interactions. Part II. Computational Methods to Predict Protein and Domain Interaction Partners , 2007, PLoS Comput. Biol..

[39]  Gary D Bader,et al.  Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry , 2002, Nature.

[40]  Zhu-Hong You,et al.  t-LSE: A Novel Robust Geometric Approach for Modeling Protein-Protein Interaction Networks , 2013, PloS one.

[41]  De-Shuang Huang,et al.  Graphical Representation for DNA Sequences via Joint Diagonalization of Matrix Pencil , 2013, IEEE Journal of Biomedical and Health Informatics.

[42]  Hong-Bin Shen,et al.  Adaptive compressive learning for prediction of protein-protein interactions from primary sequence. , 2011, Journal of theoretical biology.

[43]  Shuai Li,et al.  A MapReduce based parallel SVM for large-scale predicting protein-protein interactions , 2014, Neurocomputing.

[44]  Yanjun Qi,et al.  Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources , 2004, Pacific Symposium on Biocomputing.