Quantitative assessment of protein function prediction from metagenomics shotgun sequences

To assess the potential of protein function prediction in environmental genomics data, we analyzed shotgun sequences from four diverse and complex habitats. Using homology searches as well as customized gene neighborhood methods that incorporate intergenic and evolutionary distances, we inferred specific functions for 76% of the 1.4 million predicted ORFs in these samples (83% when nonspecific functions are considered). Surprisingly, these fractions are only slightly smaller than the corresponding ones in completely sequenced genomes (83% and 86%, respectively, by using the same methodology) and considerably higher than previously thought. For as many as 75,448 ORFs (5% of the total), only neighborhood methods can assign functions, illustrated here by a previously undescribed gene associated with the well characterized heme biosynthesis operon and a potential transcription factor that might regulate a coupling between fatty acid biosynthesis and degradation. Our results further suggest that, although functions can be inferred for most proteins on earth, many functions remain to be discovered in numerous small, rare protein families.

[1]  S. Tringe,et al.  Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments , 2007, Science.

[2]  P. Bork,et al.  Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs , 2004, Nature Biotechnology.

[3]  Christian von Mering,et al.  STRING: known and predicted protein–protein associations, integrated and transferred across organisms , 2004, Nucleic Acids Res..

[4]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[5]  C. Chothia,et al.  Volume changes in protein evolution. , 1994, Journal of molecular biology.

[6]  Daniel Rokhsar,et al.  Reverse Methanogenesis: Testing the Hypothesis with Environmental Genomics , 2004, Science.

[7]  R. Overbeek,et al.  The use of gene clusters to infer functional coupling. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[9]  M. Noordewier,et al.  Genome Streamlining in a Cosmopolitan Oceanic Bacterium , 2005, Science.

[10]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[11]  Peer Bork,et al.  Protein function space: viewing the limits or limited by our view? , 2007, Current opinion in structural biology.

[12]  B. Snel,et al.  Function prediction and protein networks. , 2003, Current opinion in cell biology.

[13]  S. Tringe,et al.  Comparative Metagenomics of Microbial Communities , 2004, Science.

[14]  Yan P. Yuan,et al.  Predicting function: from genes to genomes and back. , 1998, Journal of molecular biology.

[15]  Natalia Ivanova,et al.  Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities , 2006, Nature Biotechnology.

[16]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[17]  Peer Bork,et al.  SMART 5: domains in the context of genomes and networks , 2005, Nucleic Acids Res..

[18]  B. Snel,et al.  Conservation of gene order: a fingerprint of proteins that physically interact. , 1998, Trends in biochemical sciences.

[19]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[20]  S. Dongen A cluster algorithm for graphs , 2000 .

[21]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[22]  John Moult,et al.  Detection of operons , 2006, Proteins.

[23]  Susumu Goto,et al.  ODB: a database of operons accumulating known operons across multiple genomes , 2005, Nucleic Acids Res..

[24]  P. Lakin-Thomas,et al.  Circadian rhythms in microorganisms: new complexities. , 2004, Annual review of microbiology.

[25]  M. Pop,et al.  Metagenomic Analysis of the Human Distal Gut Microbiome , 2006, Science.

[26]  F. Garcia-Pichel,et al.  UV B-Induced Vertical Migrations of Cyanobacteria in a Microbial Mat. , 1995, Applied and environmental microbiology.

[27]  M. Elzinga,et al.  Nucleotide sequence of the promoter and fadB gene of the fadBA operon and primary structure of the multifunctional fatty acid oxidation protein from Escherichia coli. , 1991, Biochemistry.

[28]  BMC Bioinformatics , 2005 .

[29]  P. Bork,et al.  Genome evolution reveals biochemical networks and functional modules , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[30]  P. Bork,et al.  Predicting functions from protein sequences—where are the bottlenecks? , 1998, Nature Genetics.

[31]  Temple F. Smith,et al.  Operons in Escherichia coli: genomic analyses and predictions. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[32]  A. Yayanos,et al.  Microbiology to 10,500 meters in the deep sea. , 1995, Annual review of microbiology.

[33]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[34]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[35]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[36]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[37]  D. Jahn,et al.  Bacterial heme biosynthesis and its biotechnological application , 2003, Applied Microbiology and Biotechnology.

[38]  Peer Bork,et al.  Towards Cellular Systems in 4D , 2005, Cell.

[39]  I. Zhulin,et al.  Ecological role of energy taxis in microorganisms. , 2004, FEMS microbiology reviews.

[40]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[41]  Katherine H. Huang,et al.  A novel method for accurate operon predictions in all sequenced prokaryotes , 2005, Nucleic acids research.

[42]  F. Young Biochemistry , 1955, The Indian Medical Gazette.

[43]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[44]  L. Øvreås,et al.  Microbial diversity and function in soil: from genes to ecosystems. , 2002, Current opinion in microbiology.

[45]  E. Mardis,et al.  An obesity-associated gut microbiome with increased capacity for energy harvest , 2006, Nature.

[46]  P. Bork,et al.  Prediction of effective genome size in metagenomic samples , 2007, Genome Biology.

[47]  E. Delong,et al.  Community Genomics Among Stratified Microbial Assemblages in the Ocean's Interior , 2006, Science.