Hierarchical multi-label prediction of gene function

Motivation: Assigning functions for unknown genes based on diverse large-scale data is a key task in functional genomics. Previous work on gene function prediction has addressed this problem using independent classifiers for each function. However, such an approach ignores the structure of functional class taxonomies, such as the Gene Ontology (GO). Over a hierarchy of functional classes, a group of independent classifiers where each one predicts gene membership to a particular class can produce a hierarchically inconsistent set of predictions, where for a given gene a specific class may be predicted positive while its inclusive parent class is predicted negative. Taking the hierarchical structure into account resolves such inconsistencies and provides an opportunity for leveraging all classifiers in the hierarchy to achieve higher specificity of predictions. Results: We developed a Bayesian framework for combining multiple classifiers based on the functional taxonomy constraints. Using a hierarchy of support vector machine (SVM) classifiers trained on multiple data types, we combined predictions in our Bayesian framework to obtain the most probable consistent set of predictions. Experiments show that over a 105-node subhierarchy of the GO, our Bayesian framework improves predictions for 93 nodes. As an additional benefit, our method also provides implicit calibration of SVM margin outputs to probabilities. Using this method, we make function predictions for multiple proteins, and experimentally confirm predictions for proteins involved in mitosis. Supplementary information: Results for the 105 selected GO classes and predictions for 1059 unknown genes are available at: http://function.princeton.edu/genesite/ Contact: ogt@cs.princeton.edu

[1]  S. L. Wong,et al.  Motifs, themes and thematic maps of an integrated Saccharomyces cerevisiae interaction network , 2005, Journal of biology.

[2]  Shoshana J. Wodak,et al.  CYGD: the Comprehensive Yeast Genome Database , 2004, Nucleic Acids Res..

[3]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[4]  Claudio Gentile,et al.  Regret Bounds for Hierarchical Classification with Linear-Threshold Functions , 2004, COLT.

[5]  S. Kasif,et al.  Whole-genome annotation by using evidence integration in functional-linkage networks. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[6]  David Botstein,et al.  Transcriptional remodeling in response to iron deprivation in Saccharomyces cerevisiae. , 2003, Molecular biology of the cell.

[7]  Nello Cristianini,et al.  Kernel-Based Data Fusion and Its Application to Protein Function Prediction in Yeast , 2003, Pacific Symposium on Biocomputing.

[8]  M. Gerstein,et al.  Embryonic stem cell grafting in normal and infarcted myocardium: serial assessment with MR imaging and PET dual detection. , 2009, Radiology.

[9]  E. O’Shea,et al.  Global analysis of protein localization in budding yeast , 2003, Nature.

[10]  Amanda Clare,et al.  Predicting gene function in Saccharomyces cerevisiae , 2003, ECCB.

[11]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[12]  M. Tyers,et al.  The GRID: The General Repository for Interaction Datasets , 2003, Genome Biology.

[13]  D. Botstein,et al.  Genome-wide Analysis of Gene Expression Regulated by the Calcineurin/Crz1p Signaling Pathway in Saccharomyces cerevisiae * , 2002, The Journal of Biological Chemistry.

[14]  S. Bell,et al.  The origin recognition complex: from simple origins to complex functions. , 2002, Genes & development.

[15]  W Fujibuchi,et al.  PROSPECT improves cis-acting regulatory element prediction by integrating expression profile data with consensus pattern searches. , 2001, Nucleic acids research.

[16]  D. Botstein,et al.  Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. , 2001, Molecular biology of the cell.

[17]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[18]  P. Brown,et al.  New components of a system for phosphate accumulation and polyphosphate metabolism in Saccharomyces cerevisiae revealed by genomic expression analysis. , 2000, Molecular biology of the cell.

[19]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[20]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[21]  P. Brown,et al.  Whole-genome expression analysis of snf/swi mutants of Saccharomyces cerevisiae. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[22]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[23]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[24]  D. Botstein,et al.  The transcriptional program of sporulation in budding yeast. , 1998, Science.

[25]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[26]  T. Formosa,et al.  Evidence that POB1, a Saccharomyces cerevisiae protein that binds to DNA polymerase alpha, acts in DNA metabolism in vivo , 1992, Molecular and cellular biology.

[27]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[28]  Jonathan A. Cooper,et al.  The offlo ading model for dynein function: diffe rential function of motor subunits , 2005 .

[29]  Dong Xu,et al.  Global protein function annotation through mining genome-scale data in yeast Saccharomyces cerevisiae. , 2004, Nucleic acids research.

[30]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[31]  Jason Weston,et al.  Learning Gene Functional Classifications from Multiple Data Types , 2002, J. Comput. Biol..

[32]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .