Incorporating functional inter-relationships into protein function prediction algorithms

BackgroundFunctional classification schemes (e.g. the Gene Ontology) that serve as the basis for annotation efforts in several organisms are often the source of gold standard information for computational efforts at supervised protein function prediction. While successful function prediction algorithms have been developed, few previous efforts have utilized more than the protein-to-functional class label information provided by such knowledge bases. For instance, the Gene Ontology not only captures protein annotations to a set of functional classes, but it also arranges these classes in a DAG-based hierarchy that captures rich inter-relationships between different classes. These inter-relationships present both opportunities, such as the potential for additional training examples for small classes from larger related classes, and challenges, such as a harder to learn distinction between similar GO terms, for standard classification-based approaches.ResultsWe propose a method to enhance the performance of classification-based protein function prediction algorithms by addressing the issue of using these interrelationships between functional classes constituting functional classification schemes. Using a standard measure for evaluating the semantic similarity between nodes in an ontology, we quantify and incorporate these inter-relationships into the k-nearest neighbor classifier. We present experiments on several large genomic data sets, each of which is used for the modeling and prediction of over hundred classes from the GO Biological Process ontology. The results show that this incorporation produces more accurate predictions for a large number of the functional classes considered, and also that the classes benefitted most by this approach are those containing the fewest members. In addition, we show how our proposed framework can be used for integrating information from the entire GO hierarchy for improving the accuracy of predictions made over a set of base classes. Finally, we provide qualitative and quantitative evidence that this incorporation of functional inter-relationships enables the discovery of interesting biology in the form of novel functional annotations for several yeast proteins, such as Sna4, Rtn1 and Lin1.ConclusionWe implemented and evaluated a methodology for incorporating interrelationships between functional classes into a standard classification-based protein function prediction algorithm. Our results show that this incorporation can help improve the accuracy of such algorithms, and help uncover novel biology in the form of previously unknown functional annotations. The complete source code, a sample data set and the additional files for this paper are available free of charge for non-commercial use at http://www.cs.umn.edu/vk/gaurav/functionalsimilarity/.

[1]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[2]  I. Bozzoni,et al.  The position of yeast snoRNA-coding regions within host introns is essential for their biosynthesis and for efficient splicing of the host pre-mRNA. , 2006, RNA.

[3]  Gaurav Pandey,et al.  Computational Approaches for Protein Function Prediction : A Survey , 2006 .

[4]  Chang Wang,et al.  New kernels for protein structural motif discovery and function classification , 2005, ICML.

[5]  M. Huynen,et al.  Prediction of protein function and pathways in the genome era , 2004, Cellular and Molecular Life Sciences CMLS.

[6]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[7]  Olivier Bodenreider,et al.  Incorporating ontology-driven similarity knowledge into functional genomics: an exploratory study , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[8]  Patrik D'haeseleer,et al.  How does gene expression clustering work? , 2005, Nature Biotechnology.

[9]  Vipin Kumar,et al.  Association analysis-based transformations for protein interaction networks: a function prediction case study , 2007, KDD '07.

[10]  P. Lipke,et al.  Cell Wall Architecture in Yeast: New Structure and New Challenges , 1998, Journal of bacteriology.

[11]  Angel Rubio,et al.  Correlation between gene expression and GO semantic similarity , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  Peter Walter,et al.  Translocation of lipid-linked oligosaccharides across the ER membrane requires Rft1 protein , 2002, Nature.

[13]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[14]  Carol Friedman,et al.  Information theory applied to the sparse gene ontology annotation network to predict novel gene function , 2007, ISMB/ECCB.

[15]  Jane Lomax,et al.  It's All GO for Plant Scientists1 , 2005, Plant Physiology.

[16]  Vladimir Pavlovic,et al.  Protein classification using probabilistic chain graphs and the Gene Ontology structure , 2006, Bioinform..

[17]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[18]  Sean R. Collins,et al.  Exploration of the Function and Organization of the Yeast Early Secretory Pathway through an Epistatic Miniarray Profile , 2005, Cell.

[19]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[20]  T. Hughes,et al.  Exploration of Essential Gene Functions via Titratable Promoter Alleles , 2004, Cell.

[21]  P. Orlean,et al.  Glycoprotein biosynthesis in yeast , 1993, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[22]  Matthew A. Hibbs,et al.  Finding function: evaluation methods for functional genomic data , 2006, BMC Genomics.

[23]  L. Johnston,et al.  The budding yeast U5 snRNP Prp8 is a highly conserved protein which links RNA splicing with cell cycle progression. , 1994, Nucleic acids research.

[24]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[25]  T D Lee,et al.  Biochemical and genetic analyses of the U5, U6, and U4/U6 x U5 small nuclear ribonucleoproteins from Saccharomyces cerevisiae. , 2001, RNA.

[26]  Mohammed J. Zaki,et al.  Multi-label Lazy Associative Classification , 2007, PKDD.

[27]  Jiebo Luo,et al.  Multilabel machine learning and its application to semantic scene classification , 2003, IS&T/SPIE Electronic Imaging.

[28]  Carole A. Goble,et al.  Semantic Similarity Measures as Tools for Exploring the Gene Ontology , 2002, Pacific Symposium on Biocomputing.

[29]  A. Stuart,et al.  Non-Parametric Statistics for the Behavioral Sciences. , 1957 .

[30]  Angel Rubio,et al.  Correlation between Gene Expression and GO Semantic Similarity , 2005, TCBB.

[31]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[32]  Ting Xu,et al.  Microtubules Are Involved in Glucose-dependent Dissociation of the Yeast Vacuolar [H+]-ATPase in Vivo * , 2001, The Journal of Biological Chemistry.

[33]  Walter L. Ruzzo,et al.  A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data , 2006, BMC Bioinformatics.

[34]  C. Landry,et al.  An in Vivo Map of the Yeast Protein Interactome , 2008, Science.

[35]  A. Parodi,et al.  Cell biology: Protein sweetener , 2002, Nature.

[36]  M. Gerstein,et al.  Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons. , 2002, Genome research.

[37]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[38]  M. Madan Babu,et al.  Contextual inference of protein function , 2005 .

[39]  Babak Shahbaba,et al.  Gene function classification using Bayesian models with hierarchy-based priors , 2006, BMC Bioinformatics.

[40]  Sean R. Collins,et al.  Global landscape of protein complexes in the yeast Saccharomyces cerevisiae , 2006, Nature.

[41]  Zheng Guo,et al.  Broadly predicting specific gene functions with expression similarity and taxonomy similarity. , 2005, Gene.

[42]  Sunita Sarawagi,et al.  Discriminative Methods for Multi-labeled Classification , 2004, PAKDD.

[43]  Christopher G. Burd,et al.  Saccharomyces cerevisiae Rab-GDI Displacement Factor Ortholog Yip3p Forms Distinct Complexes with the Ypt1 Rab GTPase and the Reticulon Rtn1p , 2005, Eukaryotic Cell.

[44]  Fulvio Reggiori,et al.  Sorting of proteins into multivesicular bodies: ubiquitin‐dependent and ‐independent targeting , 2001, The EMBO journal.

[45]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[46]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[47]  Mona Singh,et al.  Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps , 2005, ISMB.

[48]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[49]  George Karypis,et al.  Gene classification using expression profiles: a feasibility study , 2001, Proceedings 2nd Annual IEEE International Symposium on Bioinformatics and Bioengineering (BIBE 2001).

[50]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[51]  M. Tyers,et al.  The GRID: The General Repository for Interaction Datasets , 2003, Genome Biology.

[52]  S. Dwight,et al.  Predicting gene function from patterns of annotation. , 2003, Genome research.

[53]  Rong Jin,et al.  Correlated Label Propagation with Application to Multi-label Learning , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[54]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[55]  R. Stephenson A and V , 1962, The British journal of ophthalmology.