Patterns of protein‐fold usage in eight microbial genomes: A comprehensive structural census

Eight microbial genomes are compared in terms of protein structure. Specifically, yeast, H. influenzae, M. genitalium, M. jannaschii, Synechocystis, M. pneumoniae, H. pylori, and E. coli are compared in terms of patterns of fold usage—whether a given fold occurs in a particular organism. Of the ∼340 soluble protein folds currently in the structure databank (PDB), 240 occur in at least one of the eight genomes, and 30 are shared amongst all eight. The shared folds are depleted in all‐helical structure and enriched in mixed helix‐sheet structure compared to the folds in the PDB. The top‐10 most common of the shared 30 are enriched in superfolds, uniting many non‐homologous sequence families, and are especially similar in overall architecture—eight having helices packed onto a central sheet. They are also very different from the common folds in the PBD, highlighting databank biases. Folds can be ranked in terms of expression as well as genome duplication. In yeast the top‐10 most highly expressed folds are considerably different from the most highly duplicated folds. A tree can be constructed grouping genomes in terms of their shared folds. This has a remarkably similar topology to more conventional classifications, based on very different measures of relatedness. Finally, folds of membrane proteins can be analyzed through transmembrane‐helix (TM) prediction. All the genomes appear to have similar usage patterns for these folds, with the occurrence of a particular fold falling off rapidly with increasing numbers of TM‐elements, according to a “Zipf‐like” law. This implies there are no marked preferences for proteins with particular numbers of TM‐helices (e.g. 7‐TM) in microbial genomes. Further information pertinent to this analysis is available at http://bioinfo.mbb.yale.edu/genome. Proteins 33:518–534, 1998. © 1998 Wiley‐Liss, Inc.

[1]  J. S. Roach What's in a genome? , 2000, Analytical chemistry.

[2]  M. Gerstein How representative are the known structures of the proteins in a complete genome? A comprehensive structural census. , 1998, Folding & design.

[3]  M. Gerstein,et al.  Comparing genomes in terms of protein structure: surveys of a finite parts list. , 1998, FEMS microbiology reviews.

[4]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[5]  M. Levitt,et al.  A unified statistical framework for sequence comparison and structure comparison. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[6]  D. Voytas,et al.  Transposable elements and genome organization: a comprehensive survey of retrotransposons revealed by the complete Saccharomyces cerevisiae genome sequence. , 1998, Genome research.

[7]  G. Heijne,et al.  Genome‐wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms , 1998, Protein science : a publication of the Protein Society.

[8]  G. Rose,et al.  Seeking an ancient enzyme in Methanococcus jannaschii using ORF, a program based on predicted secondary structure comparisons. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[9]  David T. Jones Do transmembrane protein superfolds exist? , 1998, FEBS letters.

[10]  W. Pearson Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[11]  J. Beckwith,et al.  How many membrane proteins are there? , 1998, Protein science : a publication of the Protein Society.

[12]  Peter D. Karp,et al.  EcoCyc: Encyclopedia of Escherichia coli genes and metabolism , 1998, Nucleic Acids Res..

[13]  David Botstein,et al.  SGD: Saccharomyces Genome Database , 1998, Nucleic Acids Res..

[14]  Tim J. P. Hubbard,et al.  SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[15]  M Gerstein,et al.  A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure. , 1997, Journal of molecular biology.

[16]  S. Salzberg,et al.  Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi , 1997, Nature.

[17]  R. Durbin,et al.  Analysis of protein domain families in Caenorhabditis elegans. , 1997, Genomics.

[18]  P. Brown,et al.  Yeast microarrays for genome wide parallel genetic and gene expression analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[19]  E. Bornberg-Bauer,et al.  How are model protein structures distributed in sequence space? , 1997, Biophysical journal.

[20]  D. Fischer,et al.  Assigning folds to the proteins encoded by the genome of Mycoplasma genitalium. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[21]  M. Levitt,et al.  A structural census of the current population of protein sequences. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[22]  L. Hood,et al.  Gene families: the taxonomy of protein paralogs and chimeras. , 1997, Science.

[23]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[24]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[25]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[26]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[27]  Mark Borodovsky,et al.  The complete genome sequence of the gastric pathogen Helicobacter pylori , 1997, Nature.

[28]  R. Doolittle A bug with excess gastric avidity , 1997, Nature.

[29]  A T Brünger,et al.  Are there dominant membrane protein families with a given number of helices? , 1997, Proteins.

[30]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[31]  K. H. Wolfe,et al.  Molecular evidence for an ancient duplication of the entire yeast genome , 1997, Nature.

[32]  C. Chothia,et al.  Population statistics of protein structures: lessons from structural classifications. , 1997, Current opinion in structural biology.

[33]  J. Craig Venter,et al.  The first genome from the third domain of life , 1997, Nature.

[34]  M. Riley,et al.  Protein evolution viewed through Escherichia coli protein sequences: introducing the notion of a structural segment of homology, the module. , 1997, Journal of molecular biology.

[35]  André Goffeau,et al.  The yeast genome directory. , 1997, Nature.

[36]  Mark Gerstein,et al.  How far can sequences diverge? , 1997, Nature.

[37]  R. Hancock,et al.  Sequence analysis and recombinant expression of a 28-kilodalton Treponema pallidum subsp. pallidum rare outer membrane protein (Tromp2) , 1997, Journal of bacteriology.

[38]  Wei Zhou,et al.  Characterization of the Yeast Transcriptome , 1997, Cell.

[39]  M. Gerstein,et al.  LPFC: An internet library of protein family core structures , 1997, Protein science : a publication of the Protein Society.

[40]  William R. Pearson,et al.  Identifying distantly related protein sequences , 1991, Comput. Appl. Biosci..

[41]  M Gerstein,et al.  Protein evolution. How far can sequences diverge? , 1997, Nature.

[42]  J L Sussman,et al.  Protein Data Bank archives of three-dimensional macromolecular structures. , 1997, Methods in enzymology.

[43]  Monica Riley,et al.  Genes and proteins of Escherichia coli K-12 (GenProtEC) , 1997, Nucleic Acids Res..

[44]  Sándor Pongor,et al.  The SBASE protein domain library, release 5.0: a collection of annotated protein sequence segments , 1997, Nucleic Acids Res..

[45]  R. Hancock,et al.  Sequence Analysis and Recombinant Expression of a 28-Kilodalton Treponema pallidum subsp . pallidum Rare Outer Membrane Protein ( Tromp 2 ) † , 1997 .

[46]  A. Valencia,et al.  Conserved Clusters of Functionally Related Genes in Two Bacterial Genomes , 1997, Journal of Molecular Evolution.

[47]  S. Karlin,et al.  Frequent oligonucleotides and peptides of the Haemophilus influenzae genome. , 1996, Nucleic acids research.

[48]  H. Hilbert,et al.  Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae. , 1996, Nucleic acids research.

[49]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[50]  E. Lander The New Genomics: Global Views of Biology , 1996, Science.

[51]  E. Koonin,et al.  A minimal gene set for cellular life derived by comparison of complete bacterial genomes. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[52]  R. Fleischmann,et al.  Complete Genome Sequence of the Methanogenic Archaeon, Methanococcus jannaschii , 1996, Science.

[53]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[54]  N. Wingreen,et al.  Emergence of Preferred Structures in a Simple Model of Protein Folding , 1996, Science.

[55]  C Ouzounis,et al.  The emergence of major cellular processes in evolution , 1996, FEBS letters.

[56]  Mark Gerstein,et al.  Using Iterative Dynamic Programming to Obtain Accurate Pairwise and Multiple Alignments of Protein Structures , 1996, ISMB.

[57]  Peter D. Karp,et al.  HinCyc: A Knowledge Base of the Complete Genome and Metabolic Pathways of H. influenzae , 1996, ISMB.

[58]  S Karlin,et al.  Similarities and dissimilarities of phage genomes. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[59]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[60]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[61]  R A Goldstein,et al.  Why are some proteins structures so common? , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[62]  Arcady R. Mushegian,et al.  Sequencing and analysis of bacterial genomes , 1996, Current Biology.

[63]  P. Bork,et al.  Metabolism and evolution of Haemophilus influenzae deduced from a whole-genome comparison with Escherichia coli , 1996, Current Biology.

[64]  Sándor Pongor,et al.  The SBASE protein domain library, Release 4.0: a collection of annotated protein sequence segments , 1993, Nucleic Acids Res..

[65]  Sayaka,et al.  Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions. , 1996, DNA research : an international journal for rapid publication of reports on genes and genomes.

[66]  W. Pearson Effective protein sequence comparison. , 1996, Methods in enzymology.

[67]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.

[68]  Y. Nakamura,et al.  Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions (supplement). , 1996, DNA research : an international journal for rapid publication of reports on genes and genomes.

[69]  B. Rost PHD: predicting one-dimensional protein structure by profile-based neural networks. , 1996, Methods in enzymology.

[70]  T Gaasterland,et al.  Fully automated genome analysis that reflects user needs and preferences. A detailed introduction to the MAGPIE system architecture. , 1996, Biochimie.

[71]  E V Koonin,et al.  Sequence similarity analysis of Escherichia coli proteins: functional and evolutionary implications. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[72]  Mark Gerstein,et al.  Using a measure of structural variation to define a core for the globins , 1995, Comput. Appl. Biosci..

[73]  C. Chothia,et al.  Gene duplications in H. influenzae , 1995, Nature.

[74]  P Bork,et al.  New protein functions in yeast chromosome VIII , 1995, Protein science : a publication of the Protein Society.

[75]  R. Fleischmann,et al.  The Minimal Gene Complement of Mycoplasma genitalium , 1995, Science.

[76]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[77]  C. Sander,et al.  Challenging times for bioinformatics , 1995, Nature.

[78]  M. Gerstein,et al.  Average core structures and variability measures for protein families: application to the immunoglobulins. , 1995, Journal of molecular biology.

[79]  R. Nowak Bacterial genome sequence bagged. , 1995, Science.

[80]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[81]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.

[82]  A K Konopka,et al.  Noncoding DNA, Zipf's law, and language. , 1995, Science.

[83]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[84]  Joel L Sussman,et al.  PDBBrowse — a graphics interface to the Brookhaven Protein Data Bank , 1995, Nature.

[85]  B. Rost,et al.  Transmembrane helices predicted at 95% accuracy , 1995, Protein science : a publication of the Protein Society.

[86]  C Sander,et al.  Novel protein families in archaean genomes. , 1995, Nucleic acids research.

[87]  R. Doolittle The multiplicity of domains in proteins. , 1995, Annual review of biochemistry.

[88]  A Danchin,et al.  Analysis of a Bacillus subtilis genome fragment using a co-operative computer system prototype. , 1995, Gene.

[89]  R. Overbeek,et al.  The winds of (evolutionary) change: breathing new life into microbiology. , 1996, Journal of bacteriology.

[90]  David T. Jones,et al.  Protein superfamilles and domain superfolds , 1994, Nature.

[91]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[92]  F Flam,et al.  Hints of a language in junk DNA. , 1994, Science.

[93]  John C. Wootton,et al.  Non-globular Domains in Protein Sequences: Automated Segmentation Using Complexity Measures , 1994, Comput. Chem..

[94]  C. Sander,et al.  From genome sequences to protein function , 1994 .

[95]  U. Hobohm,et al.  Enlarged representative set of protein structures , 1994, Protein science : a publication of the Protein Society.

[96]  E. Sonnhammer,et al.  Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.

[97]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[98]  Mark Gerstein,et al.  Finding an Average Core Structure: Application to the Globins , 1994, ISMB.

[99]  S. Henikoff,et al.  Protein family classification based on searching a database of blocks. , 1994, Genomics.

[100]  Chris Sander,et al.  GeneQuiz: A Workbench for Sequence Analysis , 1994, ISMB.

[101]  A. Goffeau,et al.  How many yeast genes code for membrane‐spanning proteins? , 1993, Yeast.

[102]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[103]  David Eisenberg,et al.  Inverted protein structure prediction , 1993 .

[104]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[105]  P. Green,et al.  Ancient conserved regions in new gene sequences and the protein databases. , 1993, Science.

[106]  C. Sander,et al.  Comprehensive sequence analysis of the 182 predicted open reading frames of yeast chromosome III , 1992, Protein science : a publication of the Protein Society.

[107]  T. Salakoski,et al.  Selection of a representative set of structures from brookhaven protein data bank , 1992, Proteins.

[108]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[109]  S Karlin,et al.  Statistical analyses of counts and distributions of restriction sites in DNA sequences. , 1992, Nucleic acids research.

[110]  P. Argos,et al.  A data bank merging related protein structures and sequences. , 1992, Protein engineering.

[111]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[112]  G. Schulz,et al.  Molecular architecture and electrostatic properties of a bacterial porin. , 1991, Science.

[113]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[114]  P. Kraulis A program to produce both detailed and schematic plots of protein structures , 1991 .

[115]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[116]  Larry Wall,et al.  Programming Perl , 1991 .

[117]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[118]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[119]  M Levitt,et al.  Alignment of the amino acid sequences of distantly related proteins using variable gap penalties. , 1986, Protein engineering.

[120]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[121]  T. Steitz,et al.  Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. , 1986, Annual review of biophysics and biophysical chemistry.

[122]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[123]  S. Amari Differential Geometry of Curved Exponential Families-Curvatures and Information Loss , 1982 .

[124]  C. Chothia,et al.  Structural patterns in globular proteins , 1976, Nature.

[125]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[126]  Donald E. Knuth,et al.  The Art of Computer Programming, Vol. 3: Sorting and Searching , 1974 .

[127]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[128]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[129]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.