Protein Molecular Function Prediction

We present a statistical graphical model to infer specific molecular function for unannotated protein sequences using homology. Based on phylogenomic principles, SIFTER (Statistical Inference of Function Through Evolutionary Relationships) accurately predicts molecular function for members of a protein family given a reconciled phylogeny and available function annotations, even when the data are sparse or noisy. Our method produced specific and consistent molecular function predictions across 100 Pfam families in comparison to the Gene Ontology annotation database, BLAST, GOtcha, and Orthostrapper. We performed a more detailed exploration of functional predictions on the adenosine-59-monophosphate/adenosine deaminase family and the lactate/malate dehydrogenase family, in the former case comparing the predictions against a gold standard set of published functional characterizations. Given function annotations for 3% of the proteins in the deaminase family, SIFTER achieves 96% accuracy in predicting molecular function for experimentally characterized proteins as reported in the literature. The accuracy of SIFTER on this dataset is a significant improvement over other currently available methods such as BLAST (75%), GeneQuiz (64%), GOtcha (89%), and Orthostrapper (11%). We also experimentally characterized the adenosine deaminase from Plasmodium falciparum, confirming SIFTER’s prediction. The results illustrate the predictive power of exploiting a statistical model of function evolution in phylogenomic problems. A software implementation of SIFTER is available from the authors.

[1]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[2]  Geoffrey J. Barton,et al.  GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes , 2004, BMC Bioinformatics.

[3]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[4]  M. Sternberg,et al.  Automated prediction of protein function and detection of functional sites from structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[5]  P. Karp Call for an enzyme genomics initiative , 2004, Genome Biology.

[6]  Hans Lehrach,et al.  GOblet: a platform for Gene Ontology annotation of anonymous sequence data , 2004, Nucleic Acids Res..

[7]  B. Rannala,et al.  The Bayesian revolution in genetics , 2004, Nature Reviews Genetics.

[8]  Richard J Roberts,et al.  Identifying Protein Function—A Call for Community Action , 2004, PLoS biology.

[9]  M. O’Connell,et al.  Adenosine deaminases acting on RNA (ADARs): RNA-editing enzymes , 2004, Genome Biology.

[10]  Kimmen Sjölander,et al.  Phylogenomic inference of protein molecular function: advances and challenges , 2004, Bioinform..

[11]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[12]  Bernard Labedan,et al.  Sub-families of α/β barrel enzymes: A new adenine deaminase family , 2003 .

[13]  Carl J. Schmidt,et al.  GoFigure: Automated Gene OntologyTM annotation , 2003, Bioinform..

[14]  Caroline Hadley,et al.  Righting the wrongs , 2003, EMBO reports.

[15]  Stanley Letovsky,et al.  Predicting protein function from protein/protein interaction data: a probabilistic approach , 2003, ISMB.

[16]  Günther Zehetner,et al.  OntoBlast function: from sequence similarities directly to potential functional annotations by ontology terms , 2003, Nucleic Acids Res..

[17]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[18]  S. Dwight,et al.  Predicting gene function from patterns of annotation. , 2003, Genome research.

[19]  Dmitrij Frishman,et al.  The PEDANT genome database , 2003, Nucleic Acids Res..

[20]  M. Gerstein,et al.  Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons. , 2002, Genome research.

[21]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[22]  Sean R. Eddy,et al.  RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs , 2002, BMC Bioinformatics.

[23]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[24]  Jason Weston,et al.  Learning Gene Functional Classifications from Multiple Data Types , 2002, J. Comput. Biol..

[25]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[26]  Erik L. L. Sonnhammer,et al.  Automated ortholog inference from phylogenetic trees and calculation of orthology reliability , 2002, Bioinform..

[27]  S. Graham,et al.  Characterization of the adenosine deaminase-related growth factor (ADGF) gene family in Drosophila. , 2001, Gene.

[28]  Sean R. Eddy,et al.  A simple algorithm to infer gene duplication and speciation events on a gene tree , 2001, Bioinform..

[29]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[30]  Michael J. Stanhope,et al.  Phylogenetic analyses do not support horizontal gene transfers from bacteria to vertebrates , 2001, Nature.

[31]  Ian T. Paulsen,et al.  Complete genome sequence of Caulobacter crescentus , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[32]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[33]  Christopher J. Lee,et al.  The GeneMine system for genome/proteome annotation and collaborative data mining , 2001, IBM Syst. J..

[34]  M. Lynch,et al.  The evolutionary fate and consequences of duplicate genes. , 2000, Science.

[35]  A. Valencia,et al.  Practical limits of function prediction , 2000, Proteins.

[36]  S. Salzberg,et al.  DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae , 2000, Nature.

[37]  P Bork,et al.  Exploitation of gene context. , 2000, Current opinion in structural biology.

[38]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[39]  M. Gouy,et al.  HOBACGEN: database system for comparative genomics in bacteria. , 2000, Genome research.

[40]  E V Koonin,et al.  Bridging the gap between sequence and function. , 2000, Trends in genetics : TIG.

[41]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[42]  P. Hanawalt,et al.  A phylogenomic study of DNA repair genes, proteins, and processes. , 1999, Mutation research.

[43]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[44]  A. Fiser,et al.  Convergent evolution of Trichomonas vaginalis lactate dehydrogenase from malate dehydrogenase. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[45]  Miguel A. Andrade-Navarro,et al.  Automated genome sequence analysis and annotation , 1999, Bioinform..

[46]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[47]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[48]  Roderic D. M. Page,et al.  GeneTree: comparing gene and species phylogenies using reconciled trees , 1998, Bioinform..

[49]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[50]  B. Driscoll,et al.  Alfalfa malate dehydrogenase (MDH): molecular cloning and characterization of five different forms reveals a unique nodule-enhanced MDH. , 1998, The Plant journal : for cell and molecular biology.

[51]  R. Huber,et al.  Lactate dehydrogenase from the hyperthermophilic bacterium thermotoga maritima: the crystal structure at 2.1 A resolution reveals strategies for intrinsic protein stabilization. , 1998, Structure.

[52]  P. Bork,et al.  Predicting functions from protein sequences—where are the bottlenecks? , 1998, Nature Genetics.

[53]  J A Eisen,et al.  Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. , 1998, Genome research.

[54]  Michael Y. Galperin,et al.  Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement, and operon disruption , 1998, Silico Biol..

[55]  W. Atchley,et al.  A natural classification of the basic helix-loop-helix class of transcription factors. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[56]  Bradley P. Carlin,et al.  BAYES AND EMPIRICAL BAYES METHODS FOR DATA ANALYSIS , 1996, Stat. Comput..

[57]  T Gaasterland,et al.  MAGPIE: automated genome interpretation. , 1996, Trends in genetics : TIG.

[58]  V. Lushchak,et al.  [Functional role and properties of AMP-deaminase]. , 1996, Biokhimiia.

[59]  R. Doolittle The multiplicity of domains in proteins. , 1995, Annual review of biochemistry.

[60]  M. F. White,et al.  Expression of apple 1-aminocyclopropane-1-carboxylate synthase in Escherichia coli: kinetic characterization of wild-type and active-site mutant forms. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[61]  Raman Nambudripad,et al.  The ancient regulatory-protein family of WD-repeat proteins , 1994, Nature.

[62]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[63]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[64]  R. Hirschhorn,et al.  Genetic heterogeneity in adenosine deaminase (ADA) deficiency: five different mutations in five new patients with partial ADA deficiency. , 1986, American journal of human genetics.

[65]  G. Moore,et al.  Fitting the gene lineage into its species lineage , 1979 .

[66]  R. Elston,et al.  A general model for the genetic analysis of pedigree data. , 1971, Human heredity.

[67]  Dr. Susumu Ohno Evolution by Gene Duplication , 1970, Springer Berlin Heidelberg.

[68]  W. Fitch Distinguishing homologous from analogous proteins. , 1970, Systematic zoology.

[69]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .