AptRank: an adaptive PageRank model for protein function prediction on bi‐relational graphs

Motivation: Diffusion‐based network models are widely used for protein function prediction using protein network data and have been shown to outperform neighborhood‐based and module‐based methods. Recent studies have shown that integrating the hierarchical structure of the Gene Ontology (GO) data dramatically improves prediction accuracy. However, previous methods usually either used the GO hierarchy to refine the prediction results of multiple classifiers, or flattened the hierarchy into a function‐function similarity kernel. No study has taken the GO hierarchy into account together with the protein network as a two‐layer network model. Results: We first construct a Bi‐relational graph (Birg) model comprised of both protein‐protein association and function‐function hierarchical networks. We then propose two diffusion‐based methods, BirgRank and AptRank, both of which use PageRank to diffuse information on this two‐layer graph model. BirgRank is a direct application of traditional PageRank with fixed decay parameters. In contrast, AptRank utilizes an adaptive diffusion mechanism to improve the performance of BirgRank. We evaluate the ability of both methods to predict protein function on yeast, fly and human protein datasets, and compare with four previous methods: GeneMANIA, TMC, ProteinRank and clusDCA. We design four different validation strategies: missing function prediction, de novo function prediction, guided function prediction and newly discovered function prediction to comprehensively evaluate predictability of all six methods. We find that both BirgRank and AptRank outperform the previous methods, especially in missing function prediction when using only 10% of the data for training. Availability and Implementation: The MATLAB code is available at https://github.rcac.purdue.edu/mgribsko/aptrank. Contact: gribskov@purdue.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Bonnie Berger,et al.  Exploiting ontology graph for predicting sparsely annotated gene function , 2015, Bioinform..

[2]  Wolfgang Nejdl,et al.  An adaptive teleportation random walk model for learning social tag relevance , 2014, SIGIR.

[3]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[4]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[5]  Nicolò Cesa-Bianchi,et al.  Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference , 2012, Machine Learning.

[6]  Jesse Gillis,et al.  Progress and challenges in the computational prediction of gene function using networks: 2012-2013 update , 2013, F1000Research.

[7]  Chris H. Q. Ding,et al.  Image annotation using bi-relational graph of images and semantic labels , 2011, CVPR 2011.

[8]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[9]  Gary D. Bader,et al.  An automated method for finding molecular complexes in large protein interaction networks , 2003, BMC Bioinformatics.

[10]  David F. Gleich,et al.  Heat kernel based community detection , 2014, KDD.

[11]  Carol Friedman,et al.  Information theory applied to the sparse gene ontology annotation network to predict novel gene function , 2007, ISMB/ECCB.

[12]  Matti Pietikäinen,et al.  Large-Scale Evaluation , 2009, Encyclopedia of Biometrics.

[13]  Michael I. Jordan,et al.  A critical assessment of Mus musculus gene function prediction using integrated genomic evidence , 2008, Genome Biology.

[14]  Paul Pavlidis,et al.  “Guilt by Association” Is the Exception Rather Than the Rule in Gene Networks , 2012, PLoS Comput. Biol..

[15]  Fan Chung,et al.  The heat kernel as the pagerank of a graph , 2007, Proceedings of the National Academy of Sciences.

[16]  Juancarlos Chan,et al.  Gene Ontology Consortium: going forward , 2014, Nucleic Acids Res..

[17]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[18]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[19]  Asa Ben-Hur,et al.  Hierarchical Classification of Gene Ontology Terms Using the Gostruct Method , 2010, J. Bioinform. Comput. Biol..

[20]  Bonnie Berger,et al.  Diffusion Component Analysis: Unraveling Functional Topology in Biological Networks , 2015, RECOMB.

[21]  B. Schwikowski,et al.  A network of protein–protein interactions in yeast , 2000, Nature Biotechnology.

[22]  Yong-Jun Wang,et al.  An efficient algorithm for large scale global optimization of continuous functions , 2007 .

[23]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[24]  David F. Gleich,et al.  Random Alpha PageRank , 2009, Internet Math..

[25]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[26]  Vipin Kumar,et al.  Incorporating functional inter-relationships into protein function prediction algorithms , 2009, BMC Bioinformatics.

[27]  Jiming Liu,et al.  Predicting protein function via downward random walks on a gene ontology , 2015, BMC Bioinformatics.

[28]  Donna K. Harman,et al.  Ranking Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[29]  Ricardo A. Baeza-Yates,et al.  Generalizing PageRank: damping functions for link-based ranking algorithms , 2006, SIGIR.

[30]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[31]  R. Sharan,et al.  Network-based prediction of protein function , 2007, Molecular systems biology.

[32]  Mona Singh,et al.  Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps , 2005, ISMB.

[33]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[34]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[35]  David Warde-Farley,et al.  GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function , 2008, Genome Biology.

[36]  Zhiwen Yu,et al.  Protein Function Prediction Using Multilabel Ensemble Classification , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[37]  Giorgio Valentini,et al.  Hierarchical Ensemble Methods for Protein Function Prediction , 2014, ISRN bioinformatics.

[38]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[39]  Sara Ballouz,et al.  Positive and negative forms of replicability in gene network analysis , 2016, Bioinform..

[40]  Giorgio Valentini,et al.  True Path Rule Hierarchical Ensembles for Genome-Wide Gene Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[41]  S. Dwight,et al.  Predicting gene function from patterns of annotation. , 2003, Genome research.

[42]  Jesse Gillis,et al.  The Impact of Multifunctional Genes on "Guilt by Association" Analysis , 2011, PloS one.

[43]  Valerio Freschi,et al.  Protein function prediction from interaction networks using a random walk ranking algorithm , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[44]  Sara Ballouz,et al.  Bias tradeoffs in the creation and analysis of protein-protein interaction networks. , 2014, Journal of proteomics.

[45]  Quaid Morris,et al.  Fast integration of heterogeneous data sources for predicting gene function with limited annotation , 2010, Bioinform..

[46]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[47]  Christos Faloutsos,et al.  Fast Random Walk with Restart and Its Applications , 2006, Sixth International Conference on Data Mining (ICDM'06).