Exploiting ontology graph for predicting sparsely annotated gene function

Motivation: Systematically predicting gene (or protein) function based on molecular interaction networks has become an important tool in refining and enhancing the existing annotation catalogs, such as the Gene Ontology (GO) database. However, functional labels with only a few (<10) annotated genes, which constitute about half of the GO terms in yeast, mouse and human, pose a unique challenge in that any prediction algorithm that independently considers each label faces a paucity of information and thus is prone to capture non-generalizable patterns in the data, resulting in poor predictive performance. There exist a variety of algorithms for function prediction, but none properly address this ‘overfitting’ issue of sparsely annotated functions, or do so in a manner scalable to tens of thousands of functions in the human catalog. Results: We propose a novel function prediction algorithm, clusDCA, which transfers information between similar functional labels to alleviate the overfitting problem for sparsely annotated functions. Our method is scalable to datasets with a large number of annotations. In a cross-validation experiment in yeast, mouse and human, our method greatly outperformed previous state-of-the-art function prediction algorithms in predicting sparsely annotated functions, without sacrificing the performance on labels with sufficient information. Furthermore, we show that our method can accurately predict genes that will be assigned a functional label that has no known annotations, based only on the ontology graph structure and genes associated with other labels, which further suggests that our method effectively utilizes the similarity between gene functions. Availability and implementation: https://github.com/wangshenguiuc/clusDCA. Contact: jianpeng@illinois.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Jorge Nocedal,et al.  Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[2]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[3]  Stanley Letovsky,et al.  Predicting protein function from protein/protein interaction data: a probabilistic approach , 2003, ISMB.

[4]  S. Kasif,et al.  Whole-genome annotation by using evidence integration in functional-linkage networks. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Duane Szafron,et al.  Improving Protein Function Prediction using the Hierarchical Structure of the Gene Ontology , 2005, 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[6]  Mona Singh,et al.  Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps , 2005, ISMB.

[7]  Simon Kasif,et al.  The art of gene function prediction , 2006, Nature Biotechnology.

[8]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[9]  David Warde-Farley,et al.  GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function , 2008, Genome Biology.

[10]  Michael I. Jordan,et al.  A critical assessment of Mus musculus gene function prediction using integrated genomic evidence , 2008, Genome Biology.

[11]  P. Robinson,et al.  Walking the interactome for prioritization of candidate disease genes. , 2008, American journal of human genetics.

[12]  O. Troyanskaya,et al.  Predicting gene function in a hierarchical context with an ensemble of classifiers , 2008, Genome Biology.

[13]  Michael I. Jordan,et al.  Consistent probabilistic outputs for protein function prediction , 2008, Genome Biology.

[14]  W. Kim,et al.  Inferring mouse gene functions from genomic-scale data using a combined functional network/classification strategy , 2008, Genome Biology.

[15]  Tijana Milenkoviæ,et al.  Uncovering Biological Network Function via Graphlet Degree Signatures , 2008, Cancer informatics.

[16]  T. Milenković,et al.  Systems-level cancer gene identification from protein interaction network topology applied to melanogenesis-related functional genomics data , 2010, Journal of The Royal Society Interface.

[17]  Quaid Morris,et al.  Fast integration of heterogeneous data sources for predicting gene function with limited annotation , 2010, Bioinform..

[18]  Asa Ben-Hur,et al.  Hierarchical Classification of Gene Ontology Terms Using the Gostruct Method , 2010, J. Bioinform. Comput. Biol..

[19]  Trey Ideker,et al.  Cytoscape 2.8: new features for data integration and network visualization , 2010, Bioinform..

[20]  Carl Kingsford,et al.  Metric Labeling and Semi-metric Embedding for Protein Annotation Prediction , 2011, RECOMB.

[21]  Chris H. Q. Ding,et al.  Function-Function Correlated Multi-Label Protein Function Prediction over Interaction Networks , 2012, RECOMB.

[22]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[23]  Damian Szklarczyk,et al.  STRING v9.1: protein-protein interaction networks, with increased coverage and integration , 2012, Nucleic Acids Res..

[24]  Predrag Radivojac,et al.  Information-theoretic evaluation of predicted ontological annotations , 2013, Bioinform..

[25]  Noah M. Daniels,et al.  Going the Distance for Protein Function Prediction: A New Distance Metric for Protein Interaction Networks , 2013, PloS one.

[26]  T. Ideker,et al.  A gene ontology inferred from molecular networks , 2012, Nature Biotechnology.

[27]  Vineet Bafna,et al.  Inferring gene ontologies from pairwise similarity data , 2014, Bioinform..

[28]  Chris H. Q. Ding,et al.  Correlated Protein Function Prediction via Maximization of Data-Knowledge Consistency , 2014, RECOMB.

[29]  Lenore Cowen,et al.  New directions for diffusion-based network prediction of protein function: incorporating pathways with confidence , 2014, Bioinform..

[30]  Natasa Przulj,et al.  Integration of molecular network data reconstructs Gene Ontology , 2014, Bioinform..

[31]  Predrag Radivojac,et al.  The impact of incomplete knowledge on the evaluation of protein function prediction: a structured-output learning perspective , 2014, Bioinform..

[32]  Bonnie Berger,et al.  Diffusion Component Analysis: Unraveling Functional Topology in Biological Networks , 2015, RECOMB.