A critical assessment of Mus musculus gene function prediction using integrated genomic evidence

Background:Several years after sequencing the human genome and the mouse genome, much remains to be discovered about the functions of most human and mouse genes. Computational prediction of gene function promises to help focus limited experimental resources on the most likely hypotheses. Several algorithms using diverse genomic data have been applied to this task in model organisms; however, the performance of such approaches in mammals has not yet been evaluated.Results:In this study, a standardized collection of mouse functional genomic data was assembled; nine bioinformatics teams used this data set to independently train classifiers and generate predictions of function, as defined by Gene Ontology (GO) terms, for 21,603 mouse genes; and the best performing submissions were combined in a single set of predictions. We identified strengths and weaknesses of current functional genomic data sets and compared the performance of function prediction algorithms. This analysis inferred functions for 76% of mouse genes, including 5,000 currently uncharacterized genes. At a recall rate of 20%, a unified set of predictions averaged 41% precision, with 26% of GO terms achieving a precision better than 90%.Conclusion:We performed a systematic evaluation of diverse, independently developed computational approaches for predicting gene function from heterogeneous data sources in mammals. The results show that currently available data for mammals allows predictions with both breadth and accuracy. Importantly, many highly novel predictions emerge for the 38% of mouse genes that remain uncharacterized.

[1]  J. Hanley,et al.  A method of comparing the areas under receiver operating characteristic curves derived from the same cases. , 1983, Radiology.

[2]  Edda Klipp,et al.  Systems Biology , 1994 .

[3]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[5]  R. King,et al.  Accurate Prediction of Protein Functional Class From Sequence in the Mycobacterium Tuberculosis and Escherichia Coli Genomes Using Data Mining , 2000, Yeast.

[6]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[7]  S. Shen-Orr,et al.  Networks Network Motifs : Simple Building Blocks of Complex , 2002 .

[8]  Albert-László Barabási,et al.  Systems biology. Life's complexity pyramid. , 2002, Science.

[9]  E. Winzeler,et al.  Treasures and traps in genome-wide data sets: case examples from yeast , 2002, Nature Reviews Genetics.

[10]  Lars M Steinmetz,et al.  Gene function on a genomic scale. , 2002, Journal of chromatography. B, Analytical technologies in the biomedical and life sciences.

[11]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[12]  Albert-László Barabási,et al.  Life's Complexity Pyramid , 2002, Science.

[13]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2004, Nucleic Acids Res..

[14]  Dong Xu,et al.  Computational analyses of high-throughput protein-protein interaction data. , 2003, Current protein & peptide science.

[15]  Stanley Letovsky,et al.  Predicting protein function from protein/protein interaction data: a probabilistic approach , 2003, ISMB.

[16]  B. Snel,et al.  Predicting gene function by conserved co-expression. , 2003, Trends in genetics : TIG.

[17]  Brendan J. Frey,et al.  A Panoramic View of Yeast Noncoding RNA Processing , 2003, Cell.

[18]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Dong Xu,et al.  Global protein function annotation through mining genome-scale data in yeast Saccharomyces cerevisiae. , 2004, Nucleic acids research.

[20]  T. Joshi,et al.  Genome-scale gene function prediction using multiple sources of high-throughput data in yeast Saccharomyces cerevisiae. , 2004, Omics : a journal of integrative biology.

[21]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[22]  B. Frey,et al.  The functional landscape of mouse gene expression , 2004, Journal of biology.

[23]  S. Kaufmann,et al.  Modulation of T cell development and activation by novel members of the Schlafen (slfn) gene family harbouring an RNA helicase-like motif. , 2004, International immunology.

[24]  S. Batalov,et al.  A gene atlas of the mouse and human protein-encoding transcriptomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[25]  X Yu,et al.  J.Chromatogr., B: Anal. Technol. Biomed. Life Sci. , 2004 .

[26]  M. Gerstein,et al.  Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. , 2004, Genome research.

[27]  S. Kasif,et al.  Whole-genome annotation by using evidence integration in functional-linkage networks. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[28]  E. Birney,et al.  EnsMart: a generic system for fast and flexible access to biological data. , 2003, Genome research.

[29]  Matthew A. Hibbs,et al.  Discovery of biological networks from diverse functional genomic data , 2005, Genome Biology.

[30]  M. Gerstein,et al.  Assessing the limits of genomic data integration for predicting protein networks. , 2005, Genome research.

[31]  Cathy H. Wu,et al.  InterPro, progress and status in 2005 , 2004, Nucleic Acids Res..

[32]  Sarah Barber,et al.  A mouse atlas of gene expression: large-scale digital gene-expression profiles from precisely defined developing C57BL/6J mouse tissues and cells. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Erik L. L. Sonnhammer,et al.  Inparanoid: a comprehensive database of eukaryotic orthologs , 2004, Nucleic Acids Res..

[34]  Ron Shamir,et al.  Integrative analysis of genome-wide experiments in the context of a large high-throughput data compendium , 2005, Molecular systems biology.

[35]  Igor Jurisica,et al.  Online Predicted Human Interaction Database , 2005, Bioinform..

[36]  Andrea Pagnani,et al.  Predicting protein functions with message passing algorithms , 2005, Bioinform..

[37]  Walter L. Ruzzo,et al.  A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data , 2006, BMC Bioinformatics.

[38]  Matthew A. Hibbs,et al.  Finding function: evaluation methods for functional genomic data , 2006, BMC Genomics.

[39]  Amanda Clare,et al.  Functional bioinformatics for Arabidopsis thaliana , 2006, Bioinform..

[40]  Ting Chen,et al.  Diffusion kernel-based logistic regression models for protein function prediction. , 2006, Omics : a journal of integrative biology.

[41]  Simon Kasif,et al.  The art of gene function prediction , 2006, Nature Biotechnology.

[42]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[43]  Michelle S. Scott,et al.  Global Survey of Organ and Organelle Protein Expression in Mouse: Combined Proteomic and Transcriptomic Profiling , 2006, Cell.

[44]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[45]  T. M. Murali,et al.  VIRGO: computational prediction of gene functions , 2006, Nucleic Acids Res..

[46]  C. Bult,et al.  Transcript Annotation in FANTOM3: Mouse Gene Catalog Based on Physical cDNAs , 2006, PLoS genetics.

[47]  Anthony J. Bonner,et al.  Connectionist Approaches for Predicting Mouse Gene Function from Gene Expression , 2006, ICONIP.

[48]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[49]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[50]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[51]  A. Abuin,et al.  Gene trap mutagenesis. , 2007, Handbook of experimental pharmacology.

[52]  A. Godzik,et al.  Computational protein function prediction: Are we making progress? , 2007, Cellular and Molecular Life Sciences.

[53]  Wolfgang Wurst,et al.  A Mouse for All Reasons , 2007, Cell.

[54]  Judith A. Blake,et al.  The mouse genome database (MGD): new features facilitating a model system , 2006, Nucleic Acids Res..