Identifying metabolic enzymes with multiple types of association evidence

BackgroundExisting large-scale metabolic models of sequenced organisms commonly include enzymatic functions which can not be attributed to any gene in that organism. Existing computational strategies for identifying such missing genes rely primarily on sequence homology to known enzyme-encoding genes.ResultsWe present a novel method for identifying genes encoding for a specific metabolic function based on a local structure of metabolic network and multiple types of functional association evidence, including clustering of genes on the chromosome, similarity of phylogenetic profiles, gene expression, protein fusion events and others. Using E. coli and S. cerevisiae metabolic networks, we illustrate predictive ability of each individual type of association evidence and show that significantly better predictions can be obtained based on the combination of all data. In this way our method is able to predict 60% of enzyme-encoding genes of E. coli metabolism within the top 10 (out of 3551) candidates for their enzymatic function, and as a top candidate within 43% of the cases.ConclusionWe illustrate that a combination of genome context and other functional association evidence is effective in predicting genes encoding metabolic enzymes. Our approach does not rely on direct sequence homology to known enzyme-encoding genes, and can be used in conjunction with traditional homology-based metabolic reconstruction methods. The method can also be used to target orphan metabolic activities.

[1]  Charles DeLisi,et al.  Identifying functional links between genes using conserved chromosomal proximity. , 2002, Trends in genetics : TIG.

[2]  W. Harkness Properties of the extended hypergeometric distribution , 1965 .

[3]  P. Bork,et al.  Measuring genome evolution. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[4]  David D. Denison,et al.  Nonlinear estimation and classification , 2003 .

[5]  Gary D Bader,et al.  Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry , 2002, Nature.

[6]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[7]  M. Gerstein,et al.  A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data , 2003, Science.

[8]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[9]  R. Overbeek,et al.  The use of gene clusters to infer functional coupling. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Peter D. Karp,et al.  A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases , 2004, BMC Bioinformatics.

[11]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[12]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[13]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[14]  Yoshihiro Yamanishi,et al.  Protein network inference from multiple genomic data: a supervised approach , 2004, ISMB/ECCB.

[15]  Berend Snel,et al.  Gene co-regulation is highly conserved in the evolution of eukaryotes and prokaryotes. , 2004, Nucleic acids research.

[16]  C. DeLisi,et al.  Genes linked by fusion events are generally of the same functional category: A systematic analysis of 30 microbial genomes , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  P. Bork,et al.  Genome evolution reveals biochemical networks and functional modules , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Simon Kasif,et al.  Identification of functional links between genes using phylogenetic profiles , 2003, Bioinform..

[19]  T. Bobik,et al.  Identification of the Human Methylmalonyl-CoA Racemase Gene Based on the Analysis of Prokaryotic Gene Arrangements , 2001, The Journal of Biological Chemistry.

[20]  Elizabeth M Glass,et al.  Molecular mechanisms involved in robustness of yeast central metabolism against null mutations. , 2005, Journal of biochemistry.

[21]  B. Palsson,et al.  An expanded genome-scale model of Escherichia coli K-12 (iJR904 GSM/GPR) , 2003, Genome Biology.

[22]  B. Palsson,et al.  Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. , 2003, Genome research.

[23]  Yoav Freund,et al.  Motif Discovery Through Predictive Modeling of Gene Regulation , 2005, RECOMB.

[24]  S. Bergmann,et al.  Similarities and Differences in Genome-Wide Expression Data of Six Organisms , 2003, PLoS biology.

[25]  E. Sonnhammer,et al.  Genomic gene clustering analysis of pathways in eukaryotes. , 2003, Genome research.

[26]  David Botstein,et al.  The Stanford Microarray Database , 2001, Nucleic Acids Res..

[27]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[28]  Matteo Pellegrini,et al.  Prolinks: a database of protein functional linkages derived from coevolution , 2004, Genome Biology.

[29]  Warren C. Lathe,et al.  Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. , 2000, Genome research.

[30]  J. Nielsen,et al.  Genome-scale analysis of Streptomyces coelicolor A3(2) metabolism. , 2005, Genome research.

[31]  David Sankoff,et al.  Rearrangements and chromosomal evolution. , 2003, Current opinion in genetics & development.

[32]  B. Snel,et al.  Systematic discovery of analogous enzymes in thiamin biosynthesis , 2003, Nature Biotechnology.

[33]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[34]  P. Bork,et al.  Functional organization of the yeast proteome by systematic analysis of protein complexes , 2002, Nature.

[35]  T. Ito,et al.  Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[36]  David Sankoff,et al.  Tests for gene clustering , 2002, RECOMB '02.

[37]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[38]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[39]  P. Bork,et al.  Metabolism and evolution of Haemophilus influenzae deduced from a whole-genome comparison with Escherichia coli , 1996, Current Biology.

[40]  R. Overbeek,et al.  Missing genes in metabolic pathways: a comparative genomics approach. , 2003, Current opinion in chemical biology.

[41]  Christian von Mering,et al.  STRING: a database of predicted functional associations between proteins , 2003, Nucleic Acids Res..

[42]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[43]  James R. Knight,et al.  A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae , 2000, Nature.

[44]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[45]  G. Church,et al.  A global view of pleiotropy and phenotypically derived gene function in yeast , 2005, Molecular systems biology.

[46]  G. Church,et al.  Expression dynamics of a cellular metabolic network , 2005, Molecular systems biology.

[47]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[48]  G. Johansson,et al.  Interaction between phosphofructokinase and aldolase from Saccharomyces cerevisiae studied by aqueous two-phase partitioning. , 2001, Journal of chromatography. B, Biomedical sciences and applications.

[49]  Francis D. Gibbons,et al.  Predicting protein complex membership using probabilistic network reliability. , 2004, Genome research.

[50]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[51]  V. de Crécy-Lagard,et al.  Identification of the tRNA-Dihydrouridine Synthase Family* , 2002, The Journal of Biological Chemistry.

[52]  Tatsuya Akutsu,et al.  Clustering of database sequences for fast homology search using upper bounds on alignment score. , 2004, Genome informatics. International Conference on Genome Informatics.

[53]  Michael Y. Galperin,et al.  The COG database: new developments in phylogenetic classification of proteins from complete genomes , 2001, Nucleic Acids Res..

[54]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[55]  George M. Church,et al.  Filling gaps in a metabolic network using expression information , 2004, ISMB/ECCB.

[56]  G. Sumara,et al.  A Probabilistic Functional Network of Yeast Genes , 2004 .

[57]  S. Cordwell Microbial genomes and “missing” enzymes: redefining biochemical pathways , 1999, Archives of Microbiology.

[58]  S. L. Wong,et al.  Combining biological networks to predict genetic interactions. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[59]  Yoshihiro Yamanishi,et al.  Supervised enzyme network inference from the integration of genomic data and chemical information , 2005, ISMB.

[60]  B. Snel,et al.  Function prediction and protein networks. , 2003, Current opinion in cell biology.

[61]  B. Snel,et al.  Predicting gene function by conserved co-expression. , 2003, Trends in genetics : TIG.

[62]  Yoav Freund,et al.  Predicting genetic regulatory response using classification , 2004, ISMB/ECCB.

[63]  Sarah A Teichmann,et al.  Conservation of gene co-regulation in prokaryotes and eukaryotes. , 2002, Trends in biotechnology.

[64]  Timothy C. Meredith,et al.  Escherichia coli YrbH Is a D-Arabinose 5-Phosphate Isomerase* , 2003, Journal of Biological Chemistry.