The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families

Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.

[1]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[2]  Ncbi National Center for Biotechnology Information , 2008 .

[3]  A. Oskooi Molecular Evolution and Phylogenetics , 2008 .

[4]  A. Halpern,et al.  The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific , 2007, PLoS biology.

[5]  Gerard Manning,et al.  Structural and Functional Diversity of the Microbial Kinome , 2007, PLoS biology.

[6]  Maureen L. Coleman,et al.  Genomic Islands and the Ecology and Evolution of Prochlorococcus , 2006, Science.

[7]  E. Delong,et al.  Community Genomics Among Stratified Microbial Assemblages in the Ocean's Interior , 2006, Science.

[8]  Steven E Brenner,et al.  The Impact of Structural Genomics: Expectations and Outcomes , 2005, Science.

[9]  Korine S. E. Ung,et al.  Evidence of a Large Novel Gene Pool Associated with Prokaryotic Genomic Islands , 2005, PLoS genetics.

[10]  Edward M. Rubin,et al.  Metagenomics: DNA sequencing of environmental samples , 2005, Nature Reviews Genetics.

[11]  M. Noordewier,et al.  Genome Streamlining in a Cosmopolitan Oceanic Bacterium , 2005, Science.

[12]  D. Field,et al.  Orphans as taxonomically restricted and ecologically important genes. , 2005, Microbiology.

[13]  S. Brenner,et al.  Update on the Pfam5000 Strategy for Selection of Structural Genomics Targets , 2005, 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference.

[14]  J. Paul,et al.  Marine phage genomics: what have we learned? , 2005, Current opinion in biotechnology.

[15]  D. Eisenberg,et al.  Crystal structure of a RuBisCO-like protein from the green sulfur bacterium Chlorobium tepidum. , 2005, Structure.

[16]  Maureen L. Coleman,et al.  Three Prochlorococcus Cyanophage Genomes: Signature Features and Ecological Interpretations , 2005, PLoS biology.

[17]  L. Nováková,et al.  Characterization of a eukaryotic type serine/threonine protein kinase and protein phosphatase of Streptococcus pneumoniae and identification of kinase substrates , 2005, The FEBS journal.

[18]  G. Prendergast,et al.  Inhibition of indoleamine 2,3-dioxygenase, an immunoregulatory target of the cancer suppression gene Bin1, potentiates cancer chemotherapy , 2005, Nature Medicine.

[19]  Christian von Mering,et al.  STRING: known and predicted protein–protein associations, integrated and transferred across organisms , 2004, Nucleic Acids Res..

[20]  Cathy H. Wu,et al.  InterPro, progress and status in 2005 , 2004, Nucleic Acids Res..

[21]  Sébastien Carrère,et al.  The ProDom database of protein domain families: more emphasis on 3D , 2004, Nucleic Acids Res..

[22]  S. Tringe,et al.  Comparative Metagenomics of Microbial Communities , 2004, Science.

[23]  S. Brenner,et al.  Implications of structural genomics target selection strategies: Pfam5000, whole genome, and random approaches , 2004, Proteins.

[24]  R. Edwards,et al.  Viral metagenomics , 2005, Nature Reviews Microbiology.

[25]  S. Heaphy,et al.  Analysis of the Virus Population Present in Equine Faeces Indicates the Presence of Hundreds of Uncharacterized Virus Genomes , 2005, Virus Genes.

[26]  Thomas Mailund,et al.  QuickJoin - fast neighbour-joining tree reconstruction , 2004, Bioinform..

[27]  D. Munn,et al.  Ido expression by dendritic cells: tolerance and tryptophan catabolism , 2004, Nature Reviews Immunology.

[28]  L. Adorini Faculty Opinions recommendation of Murine plasmacytoid dendritic cells initiate the immunosuppressive pathway of tryptophan catabolism in response to CD200 receptor engagement. , 2004 .

[29]  Daniel Rokhsar,et al.  Reverse Methanogenesis: Testing the Hypothesis with Environmental Genomics , 2004, Science.

[30]  Andrew C. Tolonen,et al.  Transfer of photosynthesis genes to and from Prochlorococcus viruses. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[31]  E. Raz,et al.  Inhibition of experimental asthma by indoleamine 2,3-dioxygenase. , 2004, The Journal of clinical investigation.

[32]  H. Ochman,et al.  Bacterial genomes as new gene homes: the genealogy of ORFans in E. coli. , 2004, Genome research.

[33]  Simon C. Potter,et al.  An overview of Ensembl. , 2004, Genome research.

[34]  Heribert Hirt,et al.  Plant PP2C phosphatases: emerging functions in stress signaling. , 2004, Trends in plant science.

[35]  Matteo Pellegrini,et al.  Prolinks: a database of protein functional linkages derived from coevolution , 2004, Genome Biology.

[36]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[37]  P. Salamon,et al.  Diversity and population structure of a near–shore marine–sediment viral community , 2004, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[38]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[39]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[40]  Ke Fan,et al.  PROTEINS: Structure, Function, and Bioinformatics 54:491–499 (2004) The Number of Protein Folds and Their Distribution Over Families in Nature , 2022 .

[41]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[42]  Julio Collado-Vides,et al.  RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12 , 2004, Nucleic Acids Res..

[43]  Damian Smedley,et al.  Ensembl 2004 , 2004, Nucleic Acids Res..

[44]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[45]  A. Tyagi,et al.  Phosphoprotein phosphatase of Mycobacterium tuberculosis dephosphorylates serine-threonine kinases PknA and PknB. , 2003, Biochemical and biophysical research communications.

[46]  P. Salamon,et al.  Metagenomic Analyses of an Uncultured Viral Community from Human Feces , 2003, Journal of bacteriology.

[47]  N. Ogasawara,et al.  A Functional Link Between RuBisCO-like Protein of Bacillus and Photosynthetic RuBisCO , 2003, Science.

[48]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[49]  S. Cole,et al.  PknB kinase activity is regulated by phosphorylation in two Thr residues and dephosphorylation by PstP, the cognate phospho‐Ser/Thr phosphatase, in Mycobacterium tuberculosis , 2003, Molecular microbiology.

[50]  D. Eisenberg,et al.  Inference of protein function and protein linkages in Mycobacterium tuberculosis based on prokaryotic genome organization: a combined computational approach , 2003, Genome Biology.

[51]  J. Denu,et al.  Probing the Function of Conserved Residues in the Serine/Threonine Phosphatase PP2Cα† , 2003 .

[52]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[53]  Shlomo Havlin,et al.  Scaling law in sizes of protein sequence families: From super‐families to orphan genes , 2003, Proteins.

[54]  H. Hirt,et al.  Stress-induced Protein Phosphatase 2C Is a Negative Regulator of a Mitogen-activated Protein Kinase* , 2003, Journal of Biological Chemistry.

[55]  L. Holm,et al.  Exhaustive enumeration of protein domain families. , 2003, Journal of molecular biology.

[56]  W. Jacobs,et al.  Origins of Highly Mosaic Mycobacteriophage Genomes , 2003, Cell.

[57]  Yasufumi Yamamoto,et al.  Comparison of the sequences of Turbo and Sulculus indoleamine dioxygenase-like myoglobin genes. , 2003, Gene.

[58]  J. Haber,et al.  PP2C phosphatases Ptc2 and Ptc3 are required for DNA checkpoint inactivation after a double-strand break. , 2003, Molecular cell.

[59]  Victor de Lorenzo,et al.  Myriads of protein families, and still counting , 2003, Genome Biology.

[60]  Robert S. Ledley,et al.  The Protein Information Resource , 2003, Nucleic Acids Res..

[61]  Owen White,et al.  The TIGRFAMs database of protein families , 2003, Nucleic Acids Res..

[62]  J. Denu,et al.  Probing the function of conserved residues in the serine/threonine phosphatase PP2Calpha. , 2003, Biochemistry.

[63]  Zukang Feng,et al.  The Protein Data Bank and structural genomics , 2003, Nucleic Acids Res..

[64]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[65]  Vincent Lombard,et al.  The EMBL Nucleotide Sequence Database: major new developments , 2003, Nucleic Acids Res..

[66]  Hideaki Sugawara,et al.  DNA Data Bank of Japan (DDBJ) in XML , 2003, Nucleic Acids Res..

[67]  Ori Sasson,et al.  ProtoNet: hierarchical classification of the protein space , 2003, Nucleic Acids Res..

[68]  E. Myers,et al.  Finishing a whole-genome shotgun: Release 3 of the Drosophila melanogaster euchromatic genome sequence , 2002, Genome Biology.

[69]  Marcin P Joachimiak,et al.  JEvTrace: refinement and variations of the evolutionary trace in JAVA , 2002, Genome Biology.

[70]  Alex Bateman,et al.  QuickTree: building huge Neighbour-Joining trees of protein sequences , 2002, Bioinform..

[71]  B. Andresen,et al.  Genomic analysis of uncultured marine viral communities , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[72]  Alexander Schliep,et al.  ProClust: improved clustering of protein sequences with an extended graph-based approach , 2002, ECCB.

[73]  H. Takami,et al.  Genome sequence of Oceanobacillus iheyensis isolated from the Iheya Ridge and its unexpected adaptive capabilities to extreme environments. , 2002, Nucleic acids research.

[74]  Robert D. Finn,et al.  The PASTA domain: a β-lactam-binding domain , 2002 .

[75]  Howard Ochman,et al.  Distinguishing the ORFs from the ELFs: short bacterial genes and the annotation of genomes. , 2002, Trends in genetics : TIG.

[76]  Ingeborg Holt,et al.  The complete genome sequence of Chlorobium tepidum TLS, a photosynthetic, anaerobic, green-sulfur bacterium , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[77]  Burkhard Rost,et al.  Did evolution leap to create the protein universe? , 2002, Current opinion in structural biology.

[78]  D. Juretic,et al.  Basic Charge Clusters and Predictions of Membrane Protein Topology , 2002, J. Chem. Inf. Comput. Sci..

[79]  M. Vidal,et al.  Structural genomics: A pipeline for providing structures for the biologist , 2002, Protein science : a publication of the Protein Society.

[80]  Martin Vingron,et al.  TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing , 2002, Bioinform..

[81]  Adam Godzik,et al.  Tolerating some redundancy significantly speeds up clustering of large protein databases , 2002, Bioinform..

[82]  Alex Bateman,et al.  The PASTA domain: a beta-lactam-binding domain. , 2002, Trends in biochemical sciences.

[83]  John Moult,et al.  A unifold, mesofold, and superfold model of protein fold use , 2002, Proteins.

[84]  Wen-Hsiung Li,et al.  The K(A)/K(S) ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study. , 2002, Genome research.

[85]  Gaetano T. Montelione,et al.  Structural genomics: An approach to the protein folding problem , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[86]  M. Gerstein,et al.  Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. , 2001, Journal of molecular biology.

[87]  A. Sali,et al.  Protein Structure Prediction and Structural Genomics , 2001, Science.

[88]  S. Brenner A tour of structural genomics , 2001, Nature Reviews Genetics.

[89]  A Bairoch,et al.  SWISS-PROT: connecting biomolecular knowledge via a protein database. , 2001, Current issues in molecular biology.

[90]  P. Kennelly,et al.  Protein phosphatases--a phylogenetic perspective. , 2001, Chemical reviews.

[91]  J. Bujnicki,et al.  Identification of a PD-(D/E)XK-like domain with a novel configuration of the endonuclease active site in the methyl-directed restriction enzyme Mrr and its homologs. , 2001, Gene.

[92]  Annabel E. Todd,et al.  Evolution of function in protein superfamilies, from a structural perspective. , 2001, Journal of molecular biology.

[93]  F. Tabita,et al.  A ribulose-1,5-bisphosphate carboxylase/oxygenase (RubisCO)-like protein from Chlorobium tepidum that is involved with sulfur metabolism and the response to oxidative stress , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[94]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[95]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[96]  Irene Ota,et al.  Ptc1, a Type 2C Ser/Thr Phosphatase, Inactivates the HOG Pathway by Dephosphorylating the Mitogen-Activated Protein Kinase Hog1 , 2001, Molecular and Cellular Biology.

[97]  Fan Yang,et al.  TIGRFAMs: a protein family resource for the functional identification of proteins , 2001, Nucleic Acids Res..

[98]  P. Setlow Resistance of spores of Bacillus species to ultraviolet light , 2001, Environmental and molecular mutagenesis.

[99]  A. Godzik,et al.  Comparison of sequence profiles. Strategies for structural predictions using sequence information , 2008, Protein science : a publication of the Protein Society.

[100]  S. Séror,et al.  Characterization of PrpC from Bacillus subtilis, a Member of the PPM Phosphatase Family , 2000, Journal of bacteriology.

[101]  C. DeLisi,et al.  Predictions of gene family distributions in microbial genomes: evolution by gene duplication and modification. , 2000, Physical review letters.

[102]  G. Sancar,et al.  Enzymatic photoreactivation: 50 years and counting. , 2000, Mutation research.

[103]  T L Blundell,et al.  Structural genomics: an overview. , 2000, Progress in biophysics and molecular biology.

[104]  Michael Y. Galperin,et al.  Who's your neighbor? New computational approaches for functional genomics , 2000, Nature Biotechnology.

[105]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[106]  N. Goldman,et al.  Codon-substitution models for heterogeneous selection pressure at amino acid sites. , 2000, Genetics.

[107]  D. Eisenberg,et al.  Structure-function relationships of glutamine synthetases. , 2000, Biochimica et biophysica acta.

[108]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[109]  John Quackenbush,et al.  The TIGR Gene Indices: reconstruction and representation of expressed gene sequences , 2000, Nucleic Acids Res..

[110]  A. Halpern,et al.  Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. , 2000, Molecular biology and evolution.

[111]  R. Apweiler Protein sequence databases. , 2000, Advances in protein chemistry.

[112]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[113]  Stephen K Burley,et al.  Structural genomics , 1999, Current Biology.

[114]  Frances M. G. Pearl,et al.  Protein folds, functions and evolution. , 1999, Journal of molecular biology.

[115]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[116]  David S. Eisenberg,et al.  Finding families for genomic ORFans , 1999, Bioinform..

[117]  Mutsuhiro Takekawa,et al.  Protein phosphatase 2Cα inhibits the human stress‐responsive p38 and JNK MAPK pathways , 1998, The EMBO journal.

[118]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[119]  M. Huynen,et al.  The frequency distribution of gene family sizes in complete genomes. , 1998, Molecular biology and evolution.

[120]  D. Lilley,et al.  DNA Repair , 1998, Nucleic Acids and Molecular Biology.

[121]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[122]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[123]  Jérôme Gouzy,et al.  The ProDom database of protein domain families , 1998, Nucleic Acids Res..

[124]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[125]  Adam Eyre-Walker,et al.  Molecular Evolution by Wen-Hsiung Li. Published by Sinauer Associates, Sunderland, MA, USA. ISBN: 0-87893-463-4 (cloth). , 1997 .

[126]  Ziheng Yang,et al.  PAML: a program package for phylogenetic analysis by maximum likelihood , 1997, Comput. Appl. Biosci..

[127]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[128]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[129]  D. Barford,et al.  Crystal structure of the protein serine/threonine phosphatase 2C at 2.0 A resolution. , 1996, The EMBO journal.

[130]  P Bork,et al.  The protein phosphatase 2C (PP2C) superfamily: Detection of bacterial homologues , 1996, Protein science : a publication of the Protein Society.

[131]  R. Lloyd,et al.  Purification and Cloning of Micrococcus luteus Ultraviolet Endonuclease, an N-Glycosylase/Abasic Lyase That Proceeds via an Imino Enzyme-DNA Intermediate (*) , 1995, The Journal of Biological Chemistry.

[132]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[133]  T. Hunter,et al.  Protein kinases and phosphatases: The Yin and Yang of protein phosphorylation and signaling , 1995, Cell.

[134]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[135]  C. Smith,et al.  A new ATP-independent DNA endonuclease from Schizosaccharomyces pombe that recognizes cyclobutane pyrimidine dimers and 6-4 photoproducts. , 1994, Nucleic acids research.

[136]  H. Nakamura,et al.  Crystal Structure of T4 Endonuclease V: An Excision Repair Enzyme for a Pyrimidine Dimer , 1994, Annals of the New York Academy of Sciences.

[137]  D. R. Benson,et al.  Close linkage of genes encoding glutamine synthetases I and II in Frankia alni CpI1 , 1993, Journal of bacteriology.

[138]  Y Tateno,et al.  Evolution of the glutamine synthetase gene, one of the oldest existing and functioning genes. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[139]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[140]  D. Labie,et al.  Molecular Evolution , 1991, Nature.

[141]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[142]  B. K. Chelm,et al.  Apparent eukaryotic origin of glutamine synthetase II from the bacterium Bradyrhizobium japonicum , 1986, Nature.

[143]  David Eisenberg,et al.  Novel subunit—subunit interactions in the structure of glutamine synthetase , 1986, Nature.

[144]  S. Fisher,et al.  Bacillus subtilis glutamine synthetase mutants pleiotropically altered in glucose catabolite repression , 1984, Journal of bacteriology.

[145]  R. Ellis The most abundant protein in the world , 1979 .

[146]  E. Stadtman,et al.  Bacillus subtilis glutamine synthetase. Purification and physical characterization. , 1970, The Journal of biological chemistry.

[147]  E. Stadtman,et al.  Regulation of glutamine synthetase. XII. Electron microscopy of the enzyme from Escherichia coli. , 1968, Biochemistry.

[148]  Lvek,et al.  Evolution of protein structures and functions , 2022 .