Active learning for protein function prediction in protein-protein interaction networks

The high-throughput technologies have led to vast amounts of protein-protein interaction (PPI) data, and a number of approaches based on PPI networks have been proposed for protein function prediction. However, these approaches do not work well if annotated proteins are scarce in the networks. To address this issue, we propose an active learning based approach that uses graph-based centrality metrics to select proper candidates for labeling. We first cluster a PPI network by using the spectral clustering algorithm and select some proper candidates for labeling within each cluster, and then apply a collective classification algorithm to predict protein function based on these annotated proteins. Experiments over two real datasets demonstrate that the active learning based approach achieves better prediction performance by choosing more informative proteins for labeling. Experimental results also validate that betweenness centrality is more effective than degree centrality and closeness centrality in most cases.

[1]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[2]  Ignacio Marín,et al.  Iterative Cluster Analysis of Protein Interaction Data , 2005, Bioinform..

[3]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[4]  Igor V. Tetko,et al.  The Mouse Functional Genome Database (MfunGD): functional annotation of proteins in the light of their cellular context , 2005, Nucleic Acids Res..

[5]  Illés J. Farkas,et al.  CFinder: locating cliques and overlapping modules in biological networks , 2006, Bioinform..

[6]  Damian Szklarczyk,et al.  The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored , 2010, Nucleic Acids Res..

[7]  Russell Greiner,et al.  Optimistic Active-Learning Using Mutual Information , 2007, IJCAI.

[8]  Ying Liu,et al.  Active Learning with Support Vector Machine Applied to Gene Expression Data for Cancer Classification , 2004, J. Chem. Inf. Model..

[9]  Lise Getoor,et al.  Collective Classification in Network Data , 2008, AI Mag..

[10]  Mona Singh,et al.  Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps , 2005, ISMB.

[11]  Limsoon Wong,et al.  Exploiting Indirect Neighbours and Topological Weight to Predict Protein Function from Protein-Protein Interactions , 2006, BioDM.

[12]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[13]  Stefan Wrobel,et al.  Multi-class Ensemble-Based Active Learning , 2006, ECML.

[14]  Jaime G. Carbonell,et al.  Active learning for human protein-protein interaction prediction , 2010, BMC Bioinformatics.

[15]  Leonard M. Freeman,et al.  A set of measures of centrality based upon betweenness , 1977 .

[16]  Ambuj K. Singh,et al.  Molecular Function Prediction Using Neighborhood Features , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[17]  Mark Craven,et al.  Multiple-Instance Active Learning , 2007, NIPS.

[18]  Kuo-Chen Chou,et al.  Predicting Functions of Proteins in Mouse Based on Weighted Protein-Protein Interaction Network and Protein Hybrid Properties , 2011, PloS one.

[19]  Gert Sabidussi,et al.  The centrality index of a graph , 1966 .

[20]  Rachael P. Huntley,et al.  The GOA database in 2009—an integrated Gene Ontology Annotation resource , 2008, Nucleic Acids Res..

[21]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[22]  Frank Dudbridge,et al.  The Use of Edge-Betweenness Clustering to Investigate Biological Function in Protein Interaction Networks , 2005, BMC Bioinformatics.

[23]  Dimitrios Vogiatzis,et al.  Active learning for microarray data , 2008, Int. J. Approx. Reason..

[24]  B. Schwikowski,et al.  A network of protein–protein interactions in yeast , 2000, Nature Biotechnology.

[25]  Yi Zhang,et al.  Incorporating Diversity and Density in Active Learning for Relevance Feedback , 2007, ECIR.

[26]  Limsoon Wong,et al.  An efficient strategy for extensive integration of diverse biological data for protein function prediction , 2007, Bioinform..

[27]  R. Sharan,et al.  Network-based prediction of protein function , 2007, Molecular systems biology.

[28]  Shoshana J. Wodak,et al.  CYGD: the Comprehensive Yeast Genome Database , 2004, Nucleic Acids Res..