Predicting Protein-Protein Interactions from Primary Protein Sequences Using a Novel Multi-Scale Local Feature Representation Scheme and the Random Forest

The study of protein-protein interactions (PPIs) can be very important for the understanding of biological cellular functions. However, detecting PPIs in the laboratories are both time-consuming and expensive. For this reason, there has been much recent effort to develop techniques for computational prediction of PPIs as this can complement laboratory procedures and provide an inexpensive way of predicting the most likely set of interactions at the entire proteome scale. Although much progress has already been achieved in this direction, the problem is still far from being solved. More effective approaches are still required to overcome the limitations of the current ones. In this study, a novel Multi-scale Local Descriptor (MLD) feature representation scheme is proposed to extract features from a protein sequence. This scheme can capture multi-scale local information by varying the length of protein-sequence segments. Based on the MLD, an ensemble learning method, the Random Forest (RF) method, is used as classifier. The MLD feature representation scheme facilitates the mining of interaction information from multi-scale continuous amino acid segments, making it easier to capture multiple overlapping continuous binding patterns within a protein sequence. When the proposed method is tested with the PPI data of Saccharomyces cerevisiae, it achieves a prediction accuracy of 94.72% with 94.34% sensitivity at the precision of 98.91%. Extensive experiments are performed to compare our method with existing sequence-based method. Experimental results show that the performance of our predictor is better than several other state-of-the-art predictors also with the H. pylori dataset. The reason why such good results are achieved can largely be credited to the learning capabilities of the RF model and the novel MLD feature representation scheme. The experiment results show that the proposed approach can be very promising for predicting PPIs and can be a useful tool for future proteomic studies.

[1]  De-Shuang Huang,et al.  Normalized Feature Vectors: A Novel Alignment-Free Sequence Comparison Method Based on the Numbers of Adjacent Amino Acids , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  Yun Gao,et al.  Prediction of Protein-Protein Interactions Using Local Description of Amino Acid Sequence , 2011 .

[3]  Hongbin Shen,et al.  BinTree Seeking: A Novel Approach to Mine Both Bi-Sparse and Cohesive Modules in Protein Interaction Networks , 2011, PloS one.

[4]  K. Aihara,et al.  A discriminative approach for identifying domain–domain interactions from protein–protein interactions , 2010, Proteins.

[5]  E. Kristiansson,et al.  A software pipeline for processing and identification of fungal ITS sequences , 2009, Source Code for Biology and Medicine.

[6]  F. Fraternali,et al.  Large-Scale Modelling of the Divergent Spectrin Repeats in Nesprins: Giant Modular Proteins , 2013, PloS one.

[7]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Yanjun Qi,et al.  Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources , 2004, Pacific Symposium on Biocomputing.

[9]  Alessandro Pandini,et al.  Detection of Allosteric Signal Transmission by Information-Theoretic Analysis of Protein Dynamics , 2012 .

[10]  Xingming Zhao,et al.  Predicting protein–protein interactions from protein sequences using meta predictor , 2010, Amino Acids.

[11]  Ioannis Xenarios,et al.  DIP: The Database of Interacting Proteins: 2001 update , 2001, Nucleic Acids Res..

[12]  De-Shuang Huang,et al.  Graphical Representation for DNA Sequences via Joint Diagonalization of Matrix Pencil , 2013, IEEE Journal of Biomedical and Health Informatics.

[13]  angesichts der Corona-Pandemie,et al.  UPDATE , 1973, The Lancet.

[14]  Hareton K. N. Leung,et al.  A Highly Efficient Approach to Protein Interactome Mapping Based on Collaborative Filtering Framework , 2015, Scientific Reports.

[15]  Noémie Elhadad,et al.  Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies , 2013, BMC Bioinformatics.

[16]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.

[17]  Benjamin A. Shoemaker,et al.  Deciphering Protein–Protein Interactions. Part II. Computational Methods to Predict Protein and Domain Interaction Partners , 2007, PLoS Comput. Biol..

[18]  Loris Nanni,et al.  Hyperplanes for predicting protein-protein interactions , 2005, Neurocomputing.

[19]  Gary D Bader,et al.  Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry , 2002, Nature.

[20]  Feiping Nie,et al.  Predicting Protein-Protein Interactions from Multimodal Biological Data Sources via Nonnegative Matrix Tri-Factorization , 2012, RECOMB.

[21]  Ashkan Golshani,et al.  Short Co-occurring Polypeptide Regions Can Predict Global Protein Interaction Maps , 2012, Scientific Reports.

[22]  Christoph Steinbeck,et al.  Self-organizing ontology of biochemically relevant small molecules , 2011, BMC Bioinformatics.

[23]  Zhu-Hong You,et al.  Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis , 2013, BMC Bioinformatics.

[24]  Suyu Mei,et al.  AdaBoost Based Multi-Instance Transfer Learning for Predicting Proteome-Wide Interactions between Salmonella and Human Proteins , 2014, PloS one.

[25]  Keith C. C. Chan,et al.  Discovering Functional Interdependence Relationship in PPI Networks for Protein Complex Identification , 2012, IEEE Transactions on Biomedical Engineering.

[26]  Xiaobo Zhou,et al.  A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network , 2010, BMC Bioinformatics.

[27]  Kyungsook Han,et al.  Sequence-based prediction of protein-protein interactions by means of rotation forest and autocorrelation descriptor. , 2010, Protein and peptide letters.

[28]  Adam J. Smith,et al.  The Database of Interacting Proteins: 2004 update , 2004, Nucleic Acids Res..

[29]  Jie Gui,et al.  Prediction of protein-protein interactions from protein sequence using local descriptors. , 2010, Protein and peptide letters.

[30]  Dmitrij Frishman,et al.  The Negatome database: a reference set of non-interacting protein pairs , 2009, Nucleic Acids Res..

[31]  P. Bork,et al.  Functional organization of the yeast proteome by systematic analysis of protein complexes , 2002, Nature.

[32]  Hong-Bin Shen,et al.  Adaptive compressive learning for prediction of protein-protein interactions from primary sequence. , 2011, Journal of theoretical biology.

[33]  Xin Li,et al.  Protein classification with imbalanced data , 2007, Proteins.

[34]  Zhu-Hong You,et al.  Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data , 2010, Bioinform..

[35]  Zhen Ji,et al.  Assessing and predicting protein interactions by combining manifold embedding with multiple information integration , 2012, BMC Bioinformatics.

[36]  Feiping Nie,et al.  Predicting Protein-Protein Interactions from Multimodal Biological Data Sources via Nonnegative Matrix Tri-Factorization , 2012, RECOMB.

[37]  Xing-Ming Zhao,et al.  A novel approach to extracting features from motif content and protein composition for protein sequence classification , 2005, Neural Networks.

[38]  William Stafford Noble,et al.  Choosing negative examples for the prediction of protein-protein interactions , 2006, BMC Bioinformatics.

[39]  Honghua Tan,et al.  Advances in Computer Science and Education Applications , 2011 .

[40]  Raquel Norel,et al.  Protein interface conservation across structure space , 2010, Proceedings of the National Academy of Sciences.

[41]  Ying Xu,et al.  Large-Scale Analyses of Glycosylation in Cellulases , 2009, Genom. Proteom. Bioinform..

[42]  A. Fornili,et al.  Protein–protein interaction networks studies and importance of 3D structure knowledge , 2013, Expert review of proteomics.

[43]  Tamás Korcsmáros,et al.  ComPPI: a cellular compartment-specific database for protein–protein interaction network analysis , 2014, Nucleic Acids Res..

[44]  B. Honig,et al.  Structure-based prediction of protein-protein interactions on a genome-wide scale , 2012, Nature.

[45]  R. Ozawa,et al.  A comprehensive two-hybrid analysis to explore the yeast protein interactome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[46]  David A. Gough,et al.  Whole-proteome interaction mining , 2003, Bioinform..

[47]  De-Shuang Huang,et al.  Predicting protein–protein interactions from sequence using correlation coefficient and high-quality interaction dataset , 2010, Amino Acids.

[48]  Hongbin Shen,et al.  Large-scale prediction of human protein-protein interactions from amino acid sequence based on latent topic features. , 2010, Journal of proteome research.

[49]  Franca Fraternali,et al.  Protein Networks Reveal Detection Bias and Species Consistency When Analysed by Information-Theoretic Methods , 2010, PloS one.

[50]  Loris Nanni,et al.  An ensemble of K-local hyperplanes for predicting protein-protein interactions , 2006, Bioinform..

[51]  Huiru Zheng,et al.  GRIP: A web-based system for constructing Gold Standard datasets for protein-protein interaction prediction , 2008, Source Code for Biology and Medicine.

[52]  M. Vidal,et al.  Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". , 2001, Genome research.

[53]  Sean R. Collins,et al.  Global landscape of protein complexes in the yeast Saccharomyces cerevisiae , 2006, Nature.

[54]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[55]  Purvesh Khatri,et al.  Onto-Tools: an ensemble of web-accessible, ontology-based tools for the functional design and interpretation of high-throughput gene expression experiments , 2004, Nucleic Acids Res..

[56]  Jean-Loup Faulon,et al.  Predicting protein-protein interactions using signature products , 2005, Bioinform..