Inferring functional modules of protein families with probabilistic topic models

BackgroundGenome and metagenome studies have identified thousands of protein families whose functions are poorly understood and for which techniques for functional characterization provide only partial information. For such proteins, the genome context can give further information about their functional context.ResultsWe describe a Bayesian method, based on a probabilistic topic model, which directly identifies functional modules of protein families. The method explores the co-occurrence patterns of protein families across a collection of sequence samples to infer a probabilistic model of arbitrarily-sized functional modules.ConclusionsWe show that our method identifies protein modules - some of which correspond to well-known biological processes - that are tightly interconnected with known functional interactions and are different from the interactions identified by pairwise co-occurrence. The modules are not specific to any given organism and may combine different realizations of a protein complex or pathway within different taxa.

[1]  Holger Fröhlich,et al.  Predicting pathway membership via domain signatures , 2008, Bioinform..

[2]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Igor B. Rogozin,et al.  Computational approaches for the analysis of gene neighbourhoods in prokaryotic genomes , 2004, Briefings Bioinform..

[4]  J. J. Díaz-Mejía,et al.  Network-based function prediction and interactomics: the case for metabolic enzymes. , 2011, Metabolic engineering.

[5]  Koji Eguchi,et al.  Predicting protein-protein relationships from literature using latent topics. , 2009, Genome informatics. International Conference on Genome Informatics.

[6]  Peter D. Karp,et al.  Machine learning methods for metabolic pathway prediction , 2010 .

[7]  Anton J. Enright,et al.  Detection of functional modules from protein interaction networks , 2003, Proteins.

[8]  Jean-Michel Claverie,et al.  FusionDB: a database for in-depth analysis of prokaryotic gene fusion events , 2004, Nucleic Acids Res..

[9]  I-Min A. Chen,et al.  The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata , 2007, Nucleic Acids Res..

[10]  Zhen Liu,et al.  Refined phylogenetic profiles method for predicting protein-protein interactions , 2005, Bioinform..

[11]  P. Turnbaugh,et al.  An Invitation to the Marriage of Metagenomics and Metabolomics , 2008, Cell.

[12]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[13]  Nikos Kyrpides,et al.  The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata , 2007, Nucleic Acids Res..

[14]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[15]  P. Bork,et al.  Genome evolution reveals biochemical networks and functional modules , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Michael I. Jordan,et al.  A latent variable model for chemogenomic profiling , 2005, Bioinform..

[17]  Lincoln Stein,et al.  Genome annotation: from sequence to biology , 2001, Nature Reviews Genetics.

[18]  P. Bork,et al.  Non-orthologous gene displacement. , 1996, Trends in genetics : TIG.

[19]  Sunil Arya,et al.  Space-time tradeoffs for approximate nearest neighbor searching , 2009, JACM.

[20]  Kelvin Xi Zhang,et al.  Pandora, a PAthway and Network DiscOveRy Approach based on common biological evidence , 2009, Bioinform..

[21]  R. Overbeek,et al.  The use of gene clusters to infer functional coupling. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[22]  M. Gerstein,et al.  Comprehensive analysis of pseudogenes in prokaryotes: widespread gene decay and failure of putative horizontally transferred genes , 2004, Genome Biology.

[23]  J. Vermunt,et al.  Latent class cluster analysis , 2002 .

[24]  Damian Szklarczyk,et al.  eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations , 2009, Nucleic Acids Res..

[25]  Bin Zheng,et al.  Identifying biological concepts from a protein-related corpus with a probabilistic topic model , 2006, BMC Bioinformatics.

[26]  BMC Bioinformatics , 2005 .

[27]  Fan Yang,et al.  TIGRFAMs: a protein family resource for the functional identification of proteins , 2001, Nucleic Acids Res..

[28]  B. Snel,et al.  Systematic discovery of analogous enzymes in thiamin biosynthesis , 2003, Nature Biotechnology.

[29]  Benjamin J. Raphael,et al.  The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families , 2007, PLoS biology.

[30]  W. Gilks Markov Chain Monte Carlo , 2005 .

[31]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[32]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[33]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[34]  C. Orengo,et al.  Protein function prediction--the power of multiplicity. , 2009, Trends in biotechnology.

[35]  Duane Szafron,et al.  Predicting homologous signaling pathways using machine learning , 2009, Bioinform..

[36]  Johannes Goll,et al.  The protein network of bacterial motility , 2007 .

[37]  Kiyoko F. Aoki-Kinoshita,et al.  From genomics to chemical genomics: new developments in KEGG , 2005, Nucleic Acids Res..

[38]  B. Snel,et al.  Predicting gene function by conserved co-expression. , 2003, Trends in genetics : TIG.

[39]  Michael C. Schatz,et al.  Revealing Biological Modules via Graph Summarization , 2009, J. Comput. Biol..

[40]  Ryosuke Watanabe,et al.  Inferring modules of functionally interacting proteins using the Bond Energy Algorithm , 2008, BMC Bioinformatics.

[41]  Maristela Pereira,et al.  Chemotaxis and flagellar genes of Chromobacterium violaceum. , 2004, Genetics and molecular research : GMR.

[42]  B. Palsson,et al.  Towards multidimensional genome annotation , 2006, Nature Reviews Genetics.

[43]  R. Overbeek,et al.  Missing genes in metabolic pathways: a comparative genomics approach. , 2003, Current opinion in chemical biology.

[44]  Teresa M. Przytycka,et al.  Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment , 2007, BMC Bioinformatics.

[45]  G. Church,et al.  Predicting regulons and their cis-regulatory motifs by comparative genomics. , 2000, Nucleic acids research.

[46]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[47]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[48]  M. Huynen,et al.  Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution , 2008, Journal of The Royal Society Interface.

[49]  J. Hopfield,et al.  From molecular to modular cell biology , 1999, Nature.

[50]  R. Overbeek,et al.  FIGfams: yet another set of protein families , 2009, Nucleic acids research.

[51]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[52]  Sylvia Richardson,et al.  Markov Chain Monte Carlo in Practice , 1997 .

[53]  J. Raes,et al.  Quantitative assessment of protein function prediction from metagenomics shotgun sequences , 2007, Proceedings of the National Academy of Sciences.

[54]  Christian von Mering,et al.  STRING: known and predicted protein–protein associations, integrated and transferred across organisms , 2004, Nucleic Acids Res..

[55]  Edward M. Rubin,et al.  Genomics of cellulosic biofuels , 2008, Nature.

[56]  Jacques van Helden,et al.  Evaluation of clustering algorithms for protein-protein interaction networks , 2006, BMC Bioinformatics.

[57]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[58]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[59]  L. Aravind Guilt by association: contextual information in genome analysis. , 2000, Genome research.

[60]  F. Cohen,et al.  Co-evolution of proteins with their interaction partners. , 2000, Journal of molecular biology.