A New Feature Vector Based on Gene Ontology Terms for Protein-Protein Interaction Prediction

Protein-protein interaction (PPI) plays a key role in understanding cellular mechanisms in different organisms. Many supervised classifiers like Random Forest (RF) and Support Vector Machine (SVM) have been used for intra or inter-species interaction prediction. For improving the prediction performance, in this paper we propose a novel set of features to represent a protein pair using their annotated Gene Ontology (GO) terms, including their ancestors. In our approach, a protein pair is treated as a document (bag of words), where the terms annotating the two proteins represent the words. Feature value of each word is calculated using information content of the corresponding term multiplied by a coefficient, which represents the weight of that term inside a document (i.e., a protein pair). We have tested the performance of the classifier using the proposed feature on different well known data sets of different species like S. cerevisiae, H. Sapiens, E. Coli, and D. melanogaster. We compare it with the other GO based feature representation technique, and demonstrate its competitive performance.

[1]  S. Wuchty Topology and weights in a protein domain interaction network – a novel way to predict protein interactions , 2006, BMC Genomics.

[2]  Yungki Park,et al.  Revisiting the negative example sampling problem for predicting protein-protein interactions , 2011, Bioinform..

[3]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[4]  Yangchao Huang,et al.  Simple sequence-based kernels do not predict protein-protein interactions , 2010, Bioinform..

[5]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[6]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[7]  Damian Szklarczyk,et al.  STRING v9.1: protein-protein interaction networks, with increased coverage and integration , 2012, Nucleic Acids Res..

[8]  Yanjun Qi,et al.  Prediction of Interactions Between HIV-1 and Human Proteins by Information Integration , 2008, Pacific Symposium on Biocomputing.

[9]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[10]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[11]  Ni Li,et al.  Gene Ontology Annotations and Resources , 2012, Nucleic Acids Res..

[12]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[13]  Yungki Park,et al.  Critical assessment of sequence-based protein-protein interaction prediction methods that do not require homologous protein sequences , 2009, BMC Bioinformatics.

[14]  Florence d'Alché-Buc,et al.  A Bayesian Framework for Combining Protein and Network Topology Information for Predicting Protein-Protein Interactions , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  William Stafford Noble,et al.  Kernel methods for predicting protein-protein interactions , 2005, ISMB.

[16]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[17]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[18]  Li Liao,et al.  Phylogenetic tree information aids supervised learning for predicting protein-protein interaction based on distance matrices , 2007, BMC Bioinformatics.

[19]  J. R. Green,et al.  Global investigation of protein–protein interactions in yeast Saccharomyces cerevisiae using re-occurring short polypeptide sequences , 2008, Nucleic acids research.

[20]  Ashkan Golshani,et al.  Computational methods for predicting protein-protein interactions. , 2008, Advances in biochemical engineering/biotechnology.

[21]  Gene Ontology Consortium,et al.  The Gene Ontology (GO) project in 2006 , 2005, Nucleic Acids Res..

[22]  Mark A. Ragan,et al.  Gene Ontology-driven inference of protein-protein interactions using inducers , 2011 .

[23]  Sanghamitra Bandyopadhyay,et al.  A new path based hybrid measure for gene ontology similarity , 2014, TCBB.

[24]  Bruno Trstenjak,et al.  on Intelligent Manufacturing and Automation , 2013 KNN with TF-IDF Based Framework for Text Categorization , 2014 .

[25]  De-Shuang Huang,et al.  Predicting protein–protein interactions from sequence using correlation coefficient and high-quality interaction dataset , 2010, Amino Acids.

[26]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[27]  Bonnie Berger,et al.  iWRAP: An interface threading approach with application to prediction of cancer-related protein-protein interactions. , 2010, Journal of molecular biology.

[28]  Jean-Loup Faulon,et al.  Predicting protein-protein interactions using signature products , 2005, Bioinform..

[29]  Bonnie Berger,et al.  Struct2Net: Integrating Structure into Protein-Protein Interaction Prediction , 2005, Pacific Symposium on Biocomputing.

[30]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[31]  A. Valencia,et al.  Correlated mutations contain information about protein-protein interaction. , 1997, Journal of molecular biology.

[32]  Albert Chan,et al.  PIPE: a protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs , 2006, BMC Bioinformatics.

[33]  B. Snel,et al.  Conservation of gene order: a fingerprint of proteins that physically interact. , 1998, Trends in biochemical sciences.

[34]  William Stafford Noble,et al.  Learning to predict protein-protein interactions from protein sequences , 2003, Bioinform..

[35]  Jesús A. Izaguirre,et al.  Predicting Protein-Protein Interactions from Protein Domains Using a Set Cover Approach , 2007, IEEE ACM Trans. Comput. Biol. Bioinform..

[36]  William Stafford Noble,et al.  Choosing negative examples for the prediction of protein-protein interactions , 2006, BMC Bioinformatics.

[37]  B. Snel,et al.  Comparative assessment of large-scale data sets of protein–protein interactions , 2002, Nature.

[38]  Xiaomei Wu,et al.  Prediction of yeast protein–protein interaction network: insights from the Gene Ontology and annotations , 2006, Nucleic acids research.

[39]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.

[40]  William Stafford Noble,et al.  Large-scale identification of yeast integral membrane protein interactions. , 2005, Proceedings of the National Academy of Sciences of the United States of America.