Distribution of protein folds in the three superkingdoms of life.

A sensitive protein-fold recognition procedure was developed on the basis of iterative database search using the PSI-BLAST program. A collection of 1193 position-dependent weight matrices that can be used as fold identifiers was produced. In the completely sequenced genomes, folds could be automatically identified for 20%-30% of the proteins, with 3%-6% more detectable by additional analysis of conserved motifs. The distribution of the most common folds is very similar in bacteria and archaea but distinct in eukaryotes. Within the bacteria, this distribution differs between parasitic and free-living species. In all analyzed genomes, the P-loop NTPases are the most abundant fold. In bacteria and archaea, the next most common folds are ferredoxin-like domains, TIM-barrels, and methyltransferases, whereas in eukaryotes, the second to fourth places belong to protein kinases, beta-propellers and TIM-barrels. The observed diversity of protein folds in different proteomes is approximately twice as high as it would be expected from a simple stochastic model describing a proteome as a finite sample from an infinite pool of proteins with an exponential distribution of the fold fractions. Distribution of the number of domains with different folds in one protein fits the geometric model, which is compatible with the evolution of multidomain proteins by random combination of domains. [Fold predictions for proteins from 14 proteomes are available on the World Wide Web at. The FIDs are available by anonymous ftp at the same location.]

[1]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[2]  D. Fischer,et al.  Assigning folds to the proteins encoded by the genome of Mycoplasma genitalium. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[3]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.

[4]  John C. Wootton,et al.  Non-globular Domains in Protein Sequences: Automated Segmentation Using Complexity Measures , 1994, Comput. Chem..

[5]  A G Murzin,et al.  Distant homology recognition using structural classification of proteins , 1997, Proteins.

[6]  C Sander,et al.  New structure--novel fold? , 1997, Structure.

[7]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[8]  R. Fleischmann,et al.  Complete Genome Sequence of the Methanogenic Archaeon, Methanococcus jannaschii , 1996, Science.

[9]  R. Fleischmann,et al.  The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus , 1997, Nature.

[10]  Temple F. Smith,et al.  Biology's new Rosetta stone , 1997, Nature.

[11]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[12]  Richard H. Lathrop,et al.  Current Limitations to Protein Threading Approaches , 1997, J. Comput. Biol..

[13]  S. Bryant,et al.  Statistics of sequence-structure threading. , 1995, Current opinion in structural biology.

[14]  W. Fitch,et al.  Construction of phylogenetic trees. , 1967, Science.

[15]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[16]  M Gerstein,et al.  A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure. , 1997, Journal of molecular biology.

[17]  Ruben Abagyan,et al.  Protein structure prediction by global energy optimization , 1997 .

[18]  A. Godzik,et al.  Sequence-structure matching in globular proteins: application to supersecondary and tertiary structure determination. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[19]  R. Doolittle Similar amino acid sequences: chance or common ancestry? , 1981, Science.

[20]  Mark Borodovsky,et al.  The complete genome sequence of the gastric pathogen Helicobacter pylori , 1997, Nature.

[21]  E V Koonin,et al.  Phosphoesterase domains associated with DNA polymerases of diverse origins. , 1998, Nucleic acids research.

[22]  H. Hilbert,et al.  Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae. , 1996, Nucleic acids research.

[23]  P Bork,et al.  Homology-based fold predictions for Mycoplasma genitalium proteins. , 1998, Journal of molecular biology.

[24]  H. Scheraga,et al.  Experimental and theoretical aspects of protein folding. , 1975, Advances in protein chemistry.

[25]  R. Fleischmann,et al.  The Minimal Gene Complement of Mycoplasma genitalium , 1995, Science.

[26]  S F Altschul,et al.  Statistical methods and insights for protein and DNA sequences. , 1991, Annual review of biophysics and biophysical chemistry.

[27]  A. Goffeau,et al.  The complete genome sequence of the Gram-positive bacterium Bacillus subtilis , 1997, Nature.

[28]  Tim J. P. Hubbard,et al.  SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[29]  S. Salzberg,et al.  Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi , 1997, Nature.

[30]  S. Tabata,et al.  Complete genome structure of the unicellular cyanobacterium Synechocystis sp. PCC6803. , 1997, Plant & cell physiology.

[31]  C. Chothia Proteins. One thousand families for the molecular biologist. , 1992, Nature.

[32]  A Kolinski,et al.  Nativelike topology assembly of small proteins using predicted restraints in Monte Carlo folding simulations. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[33]  A. Godzik,et al.  Fold and function predictions for Mycoplasma genitalium proteins. , 1998, Folding & design.

[34]  Detlef D. Leipe,et al.  Toprim--a conserved catalytic domain in type IA and II topoisomerases, DnaG-type primases, OLD family nucleases and RecR proteins. , 1998, Nucleic acids research.

[35]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[36]  R. Lathrop The protein threading problem with sequence amino acid interaction preferences is NP-complete. , 1994, Protein engineering.

[37]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[38]  R. Doolittle The multiplicity of domains in proteins. , 1995, Annual review of biochemistry.

[39]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[40]  Michael Y. Galperin,et al.  The catalytic domain of the P-type ATPase has the haloacid dehalogenase fold. , 1998, Trends in biochemical sciences.

[41]  W. Gilbert,et al.  How big is the universe of exons? , 1990, Science.

[42]  Michael Y. Galperin,et al.  Comparison of archaeal and bacterial genomes: computer analysis of protein sequences predicts novel functions and suggests a chimeric origin for the archaea , 1997, Molecular microbiology.

[43]  G. Church,et al.  Complete genome sequence of Methanobacterium thermoautotrophicum deltaH: functional analysis and comparative genomics , 1997, Journal of bacteriology.

[44]  P Bork,et al.  Positionally cloned human disease genes: patterns of evolutionary conservation and functional motifs. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[45]  J. Felsenstein Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods. , 1996, Methods in enzymology.

[46]  E V Koonin,et al.  Evidence for massive gene exchange between archaeal and bacterial hyperthermophiles. , 1998, Trends in genetics : TIG.

[47]  R. Huber,et al.  The complete genome of the hyperthermophilic bacterium Aquifex aeolicus , 1998, Nature.

[48]  Worming your way through the genome. , 1997, Trends in genetics : TIG.

[49]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[50]  M. Levitt,et al.  A structural census of the current population of protein sequences. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[51]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.