A Network-Based Approach for Protein Functions Prediction Using Locally Linear Embedding

Inferring protein functions from different data sources is a challenging task in the post-genomic era, as a large number of crude protein structures from structural genomics project are now solved without their biochemical functions characterized. Recently, many different methods have been used to predict protein functions including those based on Protein-Protein Interaction (PPI), structure, sequence relationship, gene expression data, etc. Among these approaches, methods based on protein interaction data are very promising. In this paper, we studied a network-based method using locally linear embedding (LLE). LLE is a robust learning algorithm that manipulates dimensionality reduction, neighborhood-preserving embedding for high-dimensional data. We first embed both annotated and unannotated proteins in a low dimensional Euclidean space; then, we apply semi-supervised learning techniques to classify unannotated proteins into different functional groups. Finally, we made predictions to the unknown functional proteins in yeast. 5-fold cross validation is then applied to the GO terms to compare the performance of different approaches, and the proposed method performs significantly better than the others.

[1]  T. Gaasterland,et al.  Microbial genescapes: phyletic and functional patterns of ORF distribution among prokaryotes. , 1998, Microbial & comparative genomics.

[2]  B. Snel,et al.  Conservation of gene order: a fingerprint of proteins that physically interact. , 1998, Trends in biochemical sciences.

[3]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[4]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[5]  Stanley Letovsky,et al.  Predicting protein function from protein/protein interaction data: a probabilistic approach , 2003, ISMB.

[6]  Alessandro Vespignani,et al.  Global protein function prediction from protein-protein interaction networks , 2003, Nature Biotechnology.

[7]  William Stafford Noble,et al.  Learning kernels from biological networks by maximizing entropy , 2004, ISMB/ECCB.

[8]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[9]  Mikhail Belkin,et al.  Semi-Supervised Learning on Riemannian Manifolds , 2004, Machine Learning.

[10]  Mona Singh,et al.  Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps , 2005, ISMB.

[11]  Hans-Werner Mewes,et al.  MPact: the MIPS protein interaction resource on yeast , 2005, Nucleic Acids Res..

[12]  Stephen J. Wright,et al.  Dissimilarity in Graph-Based Semi-Supervised Classification , 2007, AISTATS.

[13]  Xing-Ming Zhao,et al.  Gene function prediction using labeled and unlabeled data , 2008, BMC Bioinformatics.

[14]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.