New directions for diffusion-based network prediction of protein function: incorporating pathways with confidence

Motivation: It has long been hypothesized that incorporating models of network noise as well as edge directions and known pathway information into the representation of protein–protein interaction (PPI) networks might improve their utility for functional inference. However, a simple way to do this has not been obvious. We find that diffusion state distance (DSD), our recent diffusion-based metric for measuring dissimilarity in PPI networks, has natural extensions that incorporate confidence, directions and can even express coherent pathways by calculating DSD on an augmented graph. Results: We define three incremental versions of DSD which we term cDSD, caDSD and capDSD, where the capDSD matrix incorporates confidence, known directed edges, and pathways into the measure of how similar each pair of nodes is according to the structure of the PPI network. We test four popular function prediction methods (majority vote, weighted majority vote, multi-way cut and functional flow) using these different matrices on the Baker’s yeast PPI network in cross-validation. The best performing method is weighted majority vote using capDSD. We then test the performance of our augmented DSD methods on an integrated heterogeneous set of protein association edges from the STRING database. The superior performance of capDSD in this context confirms that treating the pathways as probabilistic units is more powerful than simply incorporating pathway edges independently into the network. Availability: All source code for calculating the confidences, for extracting pathway information from KEGG XML files, and for calculating the cDSD, caDSD and capDSD matrices are available from http://dsd.cs.tufts.edu/capdsd Contact: lenore.cowen@tufts.edu or benjamin.hescott@tufts.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[2]  Damian Szklarczyk,et al.  STRING v9.1: protein-protein interaction networks, with increased coverage and integration , 2012, Nucleic Acids Res..

[3]  Livia Perfetto,et al.  MINT, the molecular interaction database: 2012 update , 2011, Nucleic Acids Res..

[4]  P. Robinson,et al.  Walking the interactome for prioritization of candidate disease genes. , 2008, American journal of human genetics.

[5]  Ian M. Donaldson,et al.  BIND: the Biomolecular Interaction Network Database , 2001, Nucleic Acids Res..

[6]  Jian Wang,et al.  Proteome-wide Prediction of Signal Flow Direction in Protein Interaction Networks Based on Interacting Domains* , 2009, Molecular & Cellular Proteomics.

[7]  T. Takagi,et al.  Assessment of prediction accuracy of protein function from protein–protein interaction data , 2001, Yeast.

[8]  Noah M. Daniels,et al.  Going the Distance for Protein Function Prediction: A New Distance Metric for Protein Interaction Networks , 2013, PloS one.

[9]  Shang-Hua Teng,et al.  Spectral affinity in protein networks , 2009, BMC Systems Biology.

[10]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[11]  Ioannis Xenarios,et al.  DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions , 2002, Nucleic Acids Res..

[12]  Bonnie Berger,et al.  IsoRankN: spectral methods for global alignment of multiple protein networks , 2009, Bioinform..

[13]  Mehmet Koyutürk,et al.  Vavien: An Algorithm for Prioritizing Candidate Disease Genes Based on Topological Similarity of Proteins in Interaction Networks , 2011, J. Comput. Biol..

[14]  R. Sharan,et al.  Network-based prediction of protein function , 2007, Molecular systems biology.

[15]  Jing Chen,et al.  Disease candidate gene identification and prioritization using protein interaction networks , 2009, BMC Bioinformatics.

[16]  B. Schwikowski,et al.  A network of protein–protein interactions in yeast , 2000, Nature Biotechnology.

[17]  T. Ideker,et al.  Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae , 2006, Journal of biology.

[18]  B. Snel,et al.  Comparative assessment of large-scale data sets of protein–protein interactions , 2002, Nature.

[19]  R. Karp,et al.  From the Cover : Conserved patterns of protein interaction in multiple species , 2005 .

[20]  Ting Chen,et al.  Assessment of the reliability of protein-protein interactions and protein function prediction , 2002, Pacific Symposium on Biocomputing.

[21]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[22]  Roded Sharan,et al.  Associating Genes and Protein Complexes with Disease via Network Propagation , 2010, PLoS Comput. Biol..

[23]  Ignacio Marín,et al.  Iterative Cluster Analysis of Protein Interaction Data , 2005, Bioinform..

[24]  Maria Victoria Schneider,et al.  MINT: a Molecular INTeraction database. , 2002, FEBS letters.

[25]  Alessandro Vespignani,et al.  Global protein function prediction from protein-protein interaction networks , 2003, Nature Biotechnology.

[26]  Mona Singh,et al.  Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps , 2005, ISMB.

[27]  Hans-Peter Kriegel,et al.  Protein function prediction via graph kernels , 2005, ISMB.

[28]  Anupam Gupta,et al.  Discovering pathways by orienting edges in protein interaction networks , 2010, Nucleic acids research.

[29]  T. Ideker,et al.  A gene ontology inferred from molecular networks , 2012, Nature Biotechnology.

[30]  Daniel W. A. Buchan,et al.  Protein function prediction by massive integration of evolutionary analyses and multiple data sources , 2013, BMC Bioinformatics.

[31]  D. Du,et al.  Systematic Differences in Signal Emitting and Receiving Revealed by PageRank Analysis of a Human Protein Interactome , 2012, PloS one.

[32]  Julie C. Mitchell,et al.  An automated decision‐tree approach to predicting protein interaction hot spots , 2007, Proteins.

[33]  A. Moustakas Smad signalling network. , 2002, Journal of cell science.

[34]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[35]  K. N. Chandrika,et al.  Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets , 2006, Nature Genetics.

[36]  Ting Chen,et al.  Mapping gene ontology to proteins based on protein-protein interaction data , 2004, Bioinform..