Diffusion kernel-based logistic regression models for protein function prediction.

Assigning functions to unknown proteins is one of the most important problems in proteomics. Several approaches have used protein-protein interaction data to predict protein functions. We previously developed a Markov random field (MRF) based method to infer a protein's functions using protein-protein interaction data and the functional annotations of its protein interaction partners. In the original model, only direct interactions were considered and each function was considered separately. In this study, we develop a new model which extends direct interactions to all neighboring proteins, and one function to multiple functions. The goal is to understand a protein's function based on information on all the neighboring proteins in the interaction network. We first developed a novel kernel logistic regression (KLR) method based on diffusion kernels for protein interaction networks. The diffusion kernels provide means to incorporate all neighbors of proteins in the network. Second, we identified a set of functions that are highly correlated with the function of interest, referred to as the correlated functions, using the chi-square test. Third, the correlated functions were incorporated into our new KLR model. Fourth, we extended our model by incorporating multiple biological data sources such as protein domains, protein complexes, and gene expressions by converting them into networks. We showed that the KLR approach of incorporating all protein neighbors significantly improved the accuracy of protein function predictions over the MRF model. The incorporation of multiple data sets also improved prediction accuracy. The prediction accuracy is comparable to another protein function classifier based on the support vector machine (SVM), using a diffusion kernel. The advantages of the KLR model include its simplicity as well as its ability to explore the contribution of neighbors to the functions of proteins of interest.

[1]  Ting Chen,et al.  Mapping gene ontology to proteins based on protein-protein interaction data , 2004, Bioinform..

[2]  Jason Weston,et al.  Gene functional classification from heterogeneous data , 2001, RECOMB.

[3]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Nello Cristianini,et al.  Kernel-Based Data Fusion and Its Application to Protein Function Prediction in Yeast , 2003, Pacific Symposium on Biocomputing.

[5]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Amanda Clare,et al.  The utility of different representations of protein sequence for predicting functional class , 2001, Bioinform..

[7]  T. Takagi,et al.  Assessment of prediction accuracy of protein function from protein–protein interaction data , 2001, Yeast.

[8]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[9]  Hans-Werner Mewes,et al.  Integrative Analysis of Protein Interaction Data , 2000, ISMB.

[10]  R. King,et al.  On the optimization of classes for the assignment of unidentified reading frames in functional genomics programmes: the need for machine learning. , 2000, Trends in biotechnology.

[11]  Gary D Bader,et al.  Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry , 2002, Nature.

[12]  Alessandro Vespignani,et al.  Global protein function prediction from protein-protein interaction networks , 2003, Nature Biotechnology.

[13]  C. A. Andersen,et al.  Prediction of human protein function from post-translational modifications and localization features. , 2002, Journal of molecular biology.

[14]  M. Gerstein,et al.  A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data , 2003, Science.

[15]  M. Gerstein,et al.  A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. , 2000, Journal of molecular biology.

[16]  B. Snel,et al.  Comparative assessment of large-scale data sets of protein–protein interactions , 2002, Nature.

[17]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[18]  Amanda Clare,et al.  Machine learning of functional class from phenotype data , 2002, Bioinform..

[19]  B. Schwikowski,et al.  A network of protein–protein interactions in yeast , 2000, Nature Biotechnology.

[20]  Kui Zhang,et al.  Prediction of protein function using protein-protein interaction data , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[21]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[22]  William Stafford Noble,et al.  Kernel methods for predicting protein-protein interactions , 2005, ISMB.

[23]  M. Gerstein,et al.  The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. , 1999, Journal of molecular biology.

[24]  Yael Mandel-Gutfreund,et al.  Progress in Predicting Protein Function from Structure: Unique Features of O-Glycosidases , 2002, Pacific Symposium on Biocomputing.

[25]  E. O’Shea,et al.  Global analysis of protein localization in budding yeast , 2003, Nature.

[26]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[27]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[28]  Søren Brunak,et al.  Prediction of Glycosylation Across the Human Proteome and the Correlation to Protein Function , 2001, Pacific Symposium on Biocomputing.