Structural genomics analysis: Characteristics of atypical, common, and horizontally transferred folds

We conducted a structural genomics analysis of the folds and structural superfamilies in the first 20 completely sequenced genomes by focusing on the patterns of fold usage and trying to identify structural characteristics of typical and atypical folds. We assigned folds to sequences using PSI‐blast, run with a systematic protocol to reduce the amount of computational overhead. On average, folds could be assigned to about a fourth of the ORFs in the genomes and about a fifth of the amino acids in the proteomes. More than 80% of all the folds in the SCOP structural classification were identified in one of the 20 organisms, with worm and E. coli having the largest number of distinct folds. Folds are particularly effective at comprehensively measuring levels of gene duplication, because they group together even very remote homologues. Using folds, we find the average level of duplication varies depending on the complexity of the organism, ranging from 2.4 in M. genitalium to 32 for the worm, values significantly higher than those observed based purely on sequence similarity. We rank the common folds in the 20 organisms, finding that the top three are the P‐loop NTP hydrolase, the ferrodoxin fold, and the TIM‐barrel, and discuss in detail the many factors that affect and bias these rankings. We also identify atypical folds that are “unique” to one of the organisms in our study and compare the characteristics of these folds with the most common ones. We find that common folds tend be more multifunctional and associated with more regular, “symmetrical” structures than the unique ones. In addition, many of the unique folds are associated with proteins involved in cell defense (e.g., toxins). We analyze specific patterns of fold occurrence in the genomes by associating some of them with instances of horizontal transfer and others with gene loss. In particular, we find three possible examples of transfer between archaea and bacteria and six between eukarya and bacteria. We make available our detailed results at http://genecensus.org/20. Proteins 2002;47:126–141. © 2002 Wiley‐Liss, Inc.

[1]  W G Krebs,et al.  PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information. , 2001, Nucleic acids research.

[2]  Annabel E. Todd,et al.  Evolution of function in protein superfamilies, from a structural perspective. , 2001, Journal of molecular biology.

[3]  A J Olson,et al.  Structural symmetry and protein function. , 2000, Annual review of biophysics and biomolecular structure.

[4]  E V Koonin,et al.  Estimating the number of protein folds and families from complete genome data. , 2000, Journal of molecular biology.

[5]  M. Gerstein,et al.  Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. , 2000, Genome research.

[6]  A. Baucom,et al.  Predicting protein function from structure: unique structural features of proteases. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[7]  M. Gerstein,et al.  The stability of thermophilic proteins: a study based on comprehensive genome comparison , 2000, Functional & Integrative Genomics.

[8]  S. Salzberg,et al.  Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39. , 2000, Nucleic acids research.

[9]  M Gerstein,et al.  Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins. , 2000, Nucleic acids research.

[10]  J. Moult,et al.  Biological function made crystal clear - annotation of hypothetical proteins via structural genomics. , 2000, Current opinion in biotechnology.

[11]  M Gerstein,et al.  Protein folds in the worm genome. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[12]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[13]  C A Orengo,et al.  Genome analysis: Assigning protein coding regions to three‐dimensional structures , 1999 .

[14]  A. Godzik,et al.  Functional insights from structural predictions: Analysis of the Escherichia coli genome , 2008, Protein science : a publication of the Protein Society.

[15]  J. Andersson,et al.  Insights into the evolutionary process of genome degradation. , 1999, Current opinion in genetics & development.

[16]  Michael Y. Galperin,et al.  Searching for drug targets in microbial genomes. , 1999, Current opinion in biotechnology.

[17]  M. Sternberg,et al.  Benchmarking PSI-BLAST in genome annotation. , 1999, Journal of molecular biology.

[18]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[19]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[20]  Frances M. G. Pearl,et al.  Protein folds, functions and evolution. , 1999, Journal of molecular biology.

[21]  Michael Y. Galperin,et al.  Comparative genomics of the Archaea (Euryarchaeota): evolution of conserved protein families, the stable core, and the variable shell. , 1999, Genome research.

[22]  M Gerstein,et al.  Advances in structural genomics. , 1999, Current opinion in structural biology.

[23]  Ruben Recabarren,et al.  Estimating the total number of protein folds , 1999, Proteins.

[24]  E V Koonin,et al.  Rickettsiae and Chlamydiae: evidence of horizontal gene transfer and gene exchange. , 1999, Trends in genetics : TIG.

[25]  M. Gerstein,et al.  The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. , 1999, Journal of molecular biology.

[26]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[27]  David C. Jones,et al.  GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. , 1999, Journal of molecular biology.

[28]  Ronald W. Davis,et al.  Comparative genomes of Chlamydia pneumoniae and C. trachomatis , 1999, Nature Genetics.

[29]  D Fischer,et al.  Predicting structures for genome proteins. , 1999, Current opinion in structural biology.

[30]  J. Lake,et al.  Horizontal gene transfer among genomes: the complexity hypothesis. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[31]  James A. Lake,et al.  Mix and Match in the Tree of Life , 1999, Science.

[32]  Bengt Persson,et al.  KIND-a non-redundant protein database , 1999, Bioinform..

[33]  S. Gygi,et al.  Correlation between Protein and mRNA Abundance in Yeast , 1999, Molecular and Cellular Biology.

[34]  S E Brenner,et al.  Distribution of protein folds in the three superkingdoms of life. , 1999, Genome research.

[35]  Andrew Smith Genome sequence of the nematode C-elegans: A platform for investigating biology , 1998 .

[36]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[37]  C. Chothia,et al.  Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[38]  T. Sicheritz-Pontén,et al.  The genome sequence of Rickettsia prowazekii and the origin of mitochondria , 1998, Nature.

[39]  A. Sali,et al.  Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Jacquelyn S. Fetrow,et al.  Functional analysis of the Escherichia coli genome for members of the α /β hydrolase family , 1998 .

[41]  M. Gerstein How representative are the known structures of the proteins in a complete genome? A comprehensive structural census. , 1998, Folding & design.

[42]  J Skolnick,et al.  Functional analysis of the Escherichia coli genome using the sequence-to-structure-to-function paradigm: identification of proteins exhibiting the glutaredoxin/thioredoxin disulfide oxidoreductase activity. , 1998, Journal of molecular biology.

[43]  M. Gerstein,et al.  Comparing genomes in terms of protein structure: surveys of a finite parts list. , 1998, FEMS microbiology reviews.

[44]  P Bork,et al.  Homology-based fold predictions for Mycoplasma genitalium proteins. , 1998, Journal of molecular biology.

[45]  S. Salzberg,et al.  Complete genome sequence of Treponema pallidum, the syphilis spirochete. , 1998, Science.

[46]  B. Barrell,et al.  Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence , 1998, Nature.

[47]  P. Hasegawa,et al.  Osmotin, a plant antifungal protein, subverts signal transduction to enhance fungal cell susceptibility. , 1998, Molecular cell.

[48]  E. Pennisi Versatile Gene Uptake System Found in Cholera Bacterium , 1998, Science.

[49]  R. Huber,et al.  The complete genome of the hyperthermophilic bacterium Aquifex aeolicus , 1998, Nature.

[50]  W. Pearson Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[51]  J Skolnick,et al.  Functional analysis of the Escherichia coli genome for members of the alpha/beta hydrolase family. , 1998, Folding & design.

[52]  Y. Kawarabayasi,et al.  Complete sequence and gene organization of the genome of a hyper-thermophilic archaebacterium, Pyrococcus horikoshii OT3 (supplement). , 1998, DNA research : an international journal for rapid publication of reports on genes and genomes.

[53]  S H Kim,et al.  Assignment of folds for proteins of unknown function in three microbial genomes. , 1998, Microbial & comparative genomics.

[54]  F. Robb,et al.  Complete sequence and gene organization of the genome of a hyper-thermophilic archaebacterium, Pyrococcus horikoshii OT3. , 1998, DNA research : an international journal for rapid publication of reports on genes and genomes.

[55]  J. Berg Genome sequence of the nematode C. elegans: a platform for investigating biology. , 1998, Science.

[56]  Chris Sander,et al.  Touring protein fold space with Dali/FSSP , 1998, Nucleic Acids Res..

[57]  M Gerstein,et al.  A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure. , 1997, Journal of molecular biology.

[58]  S. Salzberg,et al.  Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi , 1997, Nature.

[59]  R. Fleischmann,et al.  The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus , 1997, Nature.

[60]  A. Goffeau,et al.  The complete genome sequence of the Gram-positive bacterium Bacillus subtilis , 1997, Nature.

[61]  G. Church,et al.  Complete genome sequence of Methanobacterium thermoautotrophicum deltaH: functional analysis and comparative genomics , 1997, Journal of bacteriology.

[62]  D. Fischer,et al.  Assigning folds to the proteins encoded by the genome of Mycoplasma genitalium. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[63]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[64]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[65]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[66]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[67]  Mark Borodovsky,et al.  The complete genome sequence of the gastric pathogen Helicobacter pylori , 1997, Nature.

[68]  H. Mewes,et al.  Protein structural classes in five complete genomes , 1997, Nature Structural Biology.

[69]  R. Norton,et al.  Structure of neurotoxin B-IV from the marine worm Cerebratulus lacteus: a helical hairpin cross-linked by disulphide bonding. , 1997, Journal of molecular biology.

[70]  Haiyang Li,et al.  Crystal structure of Lyme disease antigen outer surface protein A complexed with an Fab. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[71]  J Hacker,et al.  Pathogenicity islands of virulent bacteria: structure, function and impact on microbial evolution , 1997, Molecular microbiology.

[72]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[73]  T L Blundell,et al.  Symmetry, stability, and dynamics of multidomain and multicomponent protein systems. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[74]  H. Hilbert,et al.  Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae. , 1996, Nucleic acids research.

[75]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[76]  R. Fleischmann,et al.  Complete Genome Sequence of the Methanogenic Archaeon, Methanococcus jannaschii , 1996, Science.

[77]  R J Fletterick,et al.  X-ray structures of a designed binding site in trypsin show metal-dependent geometry. , 1996, Biochemistry.

[78]  J. Lefèvre,et al.  Solution structure of PMP-C: a new fold in the group of small serine proteinase inhibitors. , 1996, Journal of molecular biology.

[79]  P. Bork,et al.  Metabolism and evolution of Haemophilus influenzae deduced from a whole-genome comparison with Escherichia coli , 1996, Current Biology.

[80]  Sayaka,et al.  Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions. , 1996, DNA research : an international journal for rapid publication of reports on genes and genomes.

[81]  W. Pearson Effective protein sequence comparison. , 1996, Methods in enzymology.

[82]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.

[83]  Y. Nakamura,et al.  Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions (supplement). , 1996, DNA research : an international journal for rapid publication of reports on genes and genomes.

[84]  R. Fleischmann,et al.  The Minimal Gene Complement of Mycoplasma genitalium , 1995, Science.

[85]  A. Khimani,et al.  Structure and function of a virally encoded fungal toxin from Ustilago maydis: a fungal and mammalian Ca2+ channel inhibitor. , 1995, Structure.

[86]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[87]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[88]  W. Saenger,et al.  The complex formed between Tet repressor and tetracycline-Mg2+ reveals mechanism of antibiotic resistance. , 1995 .

[89]  David T. Jones,et al.  Protein superfamilles and domain superfolds , 1994, Nature.

[90]  A. Gronenborn,et al.  High-resolution structure of Ascaris trypsin inhibitor in solution: direct evidence for a pH-induced conformational transition in the reactive site. , 1994, Structure.

[91]  I. Campbell,et al.  Three-dimensional solution structure of the extracellular region of the complement regulatory protein CD59, a new cell-surface protein domain related to snake venom neurotoxins. , 1994, Biochemistry.

[92]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[93]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[94]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.