Active learning for protein function prediction in protein-protein interaction networks

Abstract The high-throughput technologies have led to vast amounts of protein–protein interaction (PPI) data, and a number of approaches based on PPI networks have been proposed for protein function prediction. However, these approaches do not work well if annotated or labeled proteins are scarce in the networks. To address this issue, we propose an active learning based approach that uses graph-based centrality metrics to select proper candidates for labeling. We first cluster a PPI network by using the spectral clustering algorithm and select some informative candidates for labeling within each cluster according to a certain centrality metric, and then apply a collective classification algorithm to predict protein function based on these labeled proteins. Experiments over two real datasets demonstrate that the active learning based approach achieves a better prediction performance by choosing more informative proteins for labeling. Experimental results also validate that betweenness centrality is more effective than degree centrality and closeness centrality in most cases.

[1]  Russell Greiner,et al.  Optimistic Active-Learning Using Mutual Information , 2007, IJCAI.

[2]  Kara Dolinski,et al.  The BioGRID Interaction Database: 2011 update , 2010, Nucleic Acids Res..

[3]  Lise Getoor,et al.  Collective Classification in Network Data , 2008, AI Mag..

[4]  Stefan Wrobel,et al.  Multi-class Ensemble-Based Active Learning , 2006, ECML.

[5]  Ignacio Marín,et al.  Iterative Cluster Analysis of Protein Interaction Data , 2005, Bioinform..

[6]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[7]  Tijana Milenkovic,et al.  Graphlet-based edge clustering reveals pathogen-interacting proteins , 2012, Bioinform..

[8]  Luonan Chen,et al.  Proteome-wide prediction of protein-protein interactions from high-throughput data , 2012, Protein & Cell.

[9]  Zhi-Ping Liu,et al.  Predicting gene ontology functions from protein's regional surface structures , 2007, BMC Bioinformatics.

[10]  Limsoon Wong,et al.  An efficient strategy for extensive integration of diverse biological data for protein function prediction , 2007, Bioinform..

[11]  B. Schwikowski,et al.  A network of protein–protein interactions in yeast , 2000, Nature Biotechnology.

[12]  Christie S. Chang,et al.  The BioGRID interaction database: 2013 update , 2012, Nucleic Acids Res..

[13]  Mark Craven,et al.  Multiple-Instance Active Learning , 2007, NIPS.

[14]  R. Sharan,et al.  Network-based prediction of protein function , 2007, Molecular systems biology.

[15]  Shoshana J. Wodak,et al.  CYGD: the Comprehensive Yeast Genome Database , 2004, Nucleic Acids Res..

[16]  Rachael P. Huntley,et al.  The GOA database in 2009—an integrated Gene Ontology Annotation resource , 2008, Nucleic Acids Res..

[17]  Mona Singh,et al.  Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps , 2005, ISMB.

[18]  Dimitrios Vogiatzis,et al.  Active learning for microarray data , 2008, Int. J. Approx. Reason..

[19]  Kuo-Chen Chou,et al.  Predicting Functions of Proteins in Mouse Based on Weighted Protein-Protein Interaction Network and Protein Hybrid Properties , 2011, PloS one.

[20]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[21]  A. Barabasi,et al.  High-Quality Binary Protein Interaction Map of the Yeast Interactome Network , 2008, Science.

[22]  Limsoon Wong,et al.  Exploiting indirect neighbours and topological weight to predict protein function from protein--protein interactions , 2006 .

[23]  Yi Zhang,et al.  Incorporating Diversity and Density in Active Learning for Relevance Feedback , 2007, ECIR.

[24]  Ambuj K. Singh,et al.  Molecular Function Prediction Using Neighborhood Features , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[26]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[27]  Frank Dudbridge,et al.  The Use of Edge-Betweenness Clustering to Investigate Biological Function in Protein Interaction Networks , 2005, BMC Bioinformatics.

[28]  Igor V. Tetko,et al.  The Mouse Functional Genome Database (MfunGD): functional annotation of proteins in the light of their cellular context , 2005, Nucleic Acids Res..

[29]  Illés J. Farkas,et al.  CFinder: locating cliques and overlapping modules in biological networks , 2006, Bioinform..

[30]  Alain Guénoche,et al.  Multifunctional proteins revealed by overlapping clustering in protein interaction network , 2011, Bioinform..

[31]  O. Kuchaiev,et al.  Topological network alignment uncovers biological function and phylogeny , 2008, Journal of The Royal Society Interface.

[32]  Ying Liu,et al.  Active Learning with Support Vector Machine Applied to Gene Expression Data for Cancer Classification , 2004, J. Chem. Inf. Model..

[33]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[34]  Jaime G. Carbonell,et al.  Active learning for human protein-protein interaction prediction , 2010, BMC Bioinformatics.

[35]  Leonard M. Freeman,et al.  A set of measures of centrality based upon betweenness , 1977 .

[36]  Damian Szklarczyk,et al.  The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored , 2010, Nucleic Acids Res..

[37]  Bonnie Berger,et al.  Global alignment of multiple protein interaction networks with application to functional orthology detection , 2008, Proceedings of the National Academy of Sciences.

[38]  Gert Sabidussi,et al.  The centrality index of a graph , 1966 .