A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure.

Representative genomes from each of the three kingdoms of life are compared in terms of protein structure, in particular, those of Haemophilus influenzae (a bacteria), Methanococcus jannaschii (an archaeon), and yeast (a eukaryote). The comparison is in the form of a census (or comprehensive accounting) of the relative occurrence of secondary and tertiary structures in the genomes, which particular emphasis on patterns of supersecondary structure. Comparison of secondary structure shows that the three genomes have nearly the same overall secondary-structure content, although they differ markedly in amino acid composition. Comparison of super-secondary structure, using a novel "frequent-words" approach, shows that yeast has a preponderance of consecutive strands (e.g. beta-beta-beta patterns), Haemophilus, consecutive helices (alpha-alpha-alpha), and Methanococcus, alternating helix-strand structures (beta-alpha-beta). Yeast also has significantly more helical membrane proteins than the other two genomes, with most of the differences concentrated in proteins containing two transmembrane segments. Comparison of tertiary structure (by sequence matching and domain-level clustering) highlights the substantial duplication in each genome (approximately 30% to 50%), with the degree of duplication following similar patterns in all three. Many sequence families are shared among the genomes, with the degree of overlap between any two genomes being roughly similar. In total, the three genomes contain 148 of the approximately 300 known protein folds. Forty-five of these 148 that are present in all three genomes are especially enriched in mixed super-secondary structures (alpha/beta). Moreover, the five most common of these 45 (the "top-5") have a remarkably similar super-secondary structure architecture, containing a central sheet of parallel strands with helices packed onto at least one face and beta-alpha-beta connections between adjacent strands. These most basic molecular parts, which, presumably, were present in the last common ancestor to the three Kingdoms, include the TIM-barrel, Rossmann, flavodoxin, thiamin-binding, and P-loop-hydrolase folds.

[1]  J. Gibrat,et al.  Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs. , 1987, Journal of molecular biology.

[2]  E V Koonin,et al.  Sequence similarity analysis of Escherichia coli proteins: functional and evolutionary implications. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[3]  C. Sander,et al.  Comprehensive sequence analysis of the 182 predicted open reading frames of yeast chromosome III , 1992, Protein science : a publication of the Protein Society.

[4]  R. Doolittle,et al.  Of urfs and orfs , 1986 .

[5]  S. Karlin,et al.  Frequent oligonucleotides and peptides of the Haemophilus influenzae genome. , 1996, Nucleic acids research.

[6]  Larry Wall,et al.  Programming Perl , 1991 .

[7]  C. Sander,et al.  Challenging times for bioinformatics , 1995, Nature.

[8]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[9]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[10]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[11]  C Ouzounis,et al.  The emergence of major cellular processes in evolution , 1996, FEBS letters.

[12]  Mark Gerstein,et al.  Using Iterative Dynamic Programming to Obtain Accurate Pairwise and Multiple Alignments of Protein Structures , 1996, ISMB.

[13]  Understanding protein structure , 1996 .

[14]  M. Levitt,et al.  A structural census of the current population of protein sequences. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[15]  A. Goffeau,et al.  How many yeast genes code for membrane‐spanning proteins? , 1993, Yeast.

[16]  P. Kraulis A program to produce both detailed and schematic plots of protein structures , 1991 .

[17]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[18]  Hans-Werner Mewes,et al.  the yeast genome , 1997 .

[19]  Peter D. Karp,et al.  HinCyc: A Knowledge Base of the Complete Genome and Metabolic Pathways of H. influenzae , 1996, ISMB.

[20]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[21]  J. Gibrat,et al.  GOR method for predicting protein secondary structure from amino acid sequence. , 1996, Methods in enzymology.

[22]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[23]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1977, Journal of molecular biology.

[24]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[25]  C. Chothia,et al.  Structural patterns in globular proteins , 1976, Nature.

[26]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[27]  R. Doolittle The multiplicity of domains in proteins. , 1995, Annual review of biochemistry.

[28]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[29]  Eugene V. Koonin,et al.  [18] Protein sequence comparison at genome scale , 1996 .

[30]  B. Rost,et al.  Transmembrane helices predicted at 95% accuracy , 1995, Protein science : a publication of the Protein Society.

[31]  P Bork,et al.  New protein functions in yeast chromosome VIII , 1995, Protein science : a publication of the Protein Society.

[32]  A. L. Berman,et al.  Underlying order in protein sequence organization. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[33]  T K Attwood,et al.  OWL--a non-redundant composite protein sequence database. , 1994, Nucleic acids research.

[34]  J. Garnier,et al.  Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. , 1978, Journal of molecular biology.

[35]  P. Bork,et al.  Metabolism and evolution of Haemophilus influenzae deduced from a whole-genome comparison with Escherichia coli , 1996, Current Biology.

[36]  C. Chothia,et al.  Gene duplications in H. influenzae , 1995, Nature.

[37]  Chris Sander,et al.  What's in a genome? , 1992, Nature.

[38]  T. Steitz,et al.  Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. , 1986, Annual review of biophysics and biophysical chemistry.

[39]  E. Koonin,et al.  Protein sequence comparison at genome scale. , 1996, Methods in enzymology.

[40]  Arcady R. Mushegian,et al.  Sequencing and analysis of bacterial genomes , 1996, Current Biology.

[41]  M. Riley,et al.  Protein evolution viewed through Escherichia coli protein sequences: introducing the notion of a structural segment of homology, the module. , 1997, Journal of molecular biology.

[42]  P. Green,et al.  Ancient conserved regions in new gene sequences and the protein databases. , 1993, Science.

[43]  U. Hobohm,et al.  Enlarged representative set of protein structures , 1994, Protein science : a publication of the Protein Society.

[44]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.

[45]  Chris Sander,et al.  GeneQuiz: A Workbench for Sequence Analysis , 1994, ISMB.

[46]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[47]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[48]  Mark Borodovsky,et al.  The complete genome sequence of the gastric pathogen Helicobacter pylori , 1997, Nature.

[49]  David Eisenberg,et al.  Inverted protein structure prediction , 1993 .

[50]  L Regan,et al.  A thermodynamic scale for the beta-sheet forming tendencies of the amino acids. , 1994, Biochemistry.

[51]  C. Chothia,et al.  Understanding protein structure: using scop for fold interpretation. , 1996, Methods in enzymology.

[52]  M. Gerstein,et al.  LPFC: An internet library of protein family core structures , 1997, Protein science : a publication of the Protein Society.

[53]  K. H. Wolfe,et al.  Molecular evidence for an ancient duplication of the entire yeast genome , 1997, Nature.

[54]  A. Valencia,et al.  Conserved Clusters of Functionally Related Genes in Two Bacterial Genomes , 1997, Journal of Molecular Evolution.

[55]  E. Lander The New Genomics: Global Views of Biology , 1996, Science.

[56]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[57]  P. Argos,et al.  A data bank merging related protein structures and sequences. , 1992, Protein engineering.

[58]  A T Brünger,et al.  Are there dominant membrane protein families with a given number of helices? , 1997, Proteins.

[59]  David C. Jones,et al.  Potential energy functions for threading. , 1996, Current opinion in structural biology.

[60]  Mark Gerstein,et al.  Finding an Average Core Structure: Application to the Globins , 1994, ISMB.

[61]  C Sander,et al.  Novel protein families in archaean genomes. , 1995, Nucleic acids research.

[62]  Janet M. Thornton,et al.  Protein domain superfolds and superfamilies , 1994 .

[63]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[64]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[65]  J. Craig Venter,et al.  The first genome from the third domain of life , 1997, Nature.

[66]  R. L. Baldwin,et al.  Helix propensities of the amino acids measured in alanine‐based peptides without helix‐stabilizing side‐chain interactions , 1994, Protein science : a publication of the Protein Society.

[67]  T. Traut,et al.  A minimal gene set for cellular life derived by comparison of complete bacterial genomes , 1998 .

[68]  M Gerstein,et al.  Protein evolution. How far can sequences diverge? , 1997, Nature.

[69]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[70]  Larry Wall,et al.  Programming Perl (2nd ed.) , 1996 .

[71]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[72]  M Levitt,et al.  Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins , 1998, Protein science : a publication of the Protein Society.

[73]  E. Sonnhammer,et al.  Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.

[74]  A A Salamov,et al.  Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments. , 1995, Journal of molecular biology.

[75]  B. Rost PHD: predicting one-dimensional protein structure by profile-based neural networks. , 1996, Methods in enzymology.

[76]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[77]  R. Nowak Bacterial genome sequence bagged. , 1995, Science.

[78]  André Goffeau,et al.  The yeast genome directory. , 1997, Nature.

[79]  S Karlin,et al.  Computational DNA sequence analysis. , 1994, Annual review of microbiology.

[80]  A. Goffeau,et al.  Yeast genome , 1995 .

[81]  T. Hunter,et al.  The protein kinases of budding yeast: six score and more. , 1997, Trends in biochemical sciences.

[82]  M. Gerstein,et al.  Average core structures and variability measures for protein families: application to the immunoglobulins. , 1995, Journal of molecular biology.

[83]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[84]  R. Fleischmann,et al.  Complete Genome Sequence of the Methanogenic Archaeon, Methanococcus jannaschii , 1996, Science.

[85]  R. King,et al.  Identification and application of the concepts important for accurate and reliable protein secondary structure prediction , 1996, Protein science : a publication of the Protein Society.

[86]  F. Jähnig,et al.  Structure predictions of membrane proteins are not that bad. , 1990, Trends in biochemical sciences.

[87]  Mark Gerstein,et al.  How far can sequences diverge? , 1997, Nature.

[88]  S Karlin,et al.  Similarities and dissimilarities of phage genomes. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[89]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[90]  B. Rost,et al.  Topology prediction for helical transmembrane proteins at 86% accuracy–Topology prediction at 86% accuracy , 1996, Protein science : a publication of the Protein Society.

[91]  R. Doolittle,et al.  Determining Divergence Times of the Major Kingdoms of Living Organisms with a Protein Clock , 1996, Science.

[92]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[93]  S Karlin,et al.  Statistical analyses of counts and distributions of restriction sites in DNA sequences. , 1992, Nucleic acids research.

[94]  W. Pearson Effective protein sequence comparison. , 1996, Methods in enzymology.