Protein family clustering for structural genomics.

A major goal of structural genomics is the provision of a structural template for a large fraction of protein domains. The magnitude of this task depends on the number and nature of protein sequence families. With a large number of bacterial genomes now fully sequenced, it is possible to obtain improved estimates of the number and diversity of families in that kingdom. We have used an automated clustering procedure to group all sequences in a set of genomes into protein families. Bench-marking shows the clustering method is sensitive at detecting remote family members, and has a low level of false positives. This comprehensive protein family set has been used to address the following questions. (1) What is the structure coverage for currently known families? (2) How will the number of known apparent families grow as more genomes are sequenced? (3) What is a practical strategy for maximizing structure coverage in future? Our study indicates that approximately 20% of known families with three or more members currently have a representative structure. The study indicates also that the number of apparent protein families will be considerably larger than previously thought: We estimate that, by the criteria of this work, there will be about 250,000 protein families when 1000 microbial genomes have been sequenced. However, the vast majority of these families will be small, and it will be possible to obtain structural templates for 70-80% of protein domains with an achievable number of representative structures, by systematically sampling the larger families.

[1]  J. Gough The SUPERFAMILY database in structural genomics. , 2002, Acta crystallographica. Section D, Biological crystallography.

[2]  Chris Sander,et al.  Completeness in structural genomics , 2001, Nature Structural Biology.

[3]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[4]  Nathan Linial,et al.  ProtoMap: automatic classification of protein sequences and hierarchy of protein families , 2000, Nucleic Acids Res..

[5]  M. O. Dayhoff,et al.  The origin and evolution of protein superfamilies. , 1976, Federation proceedings.

[6]  J. Moult,et al.  SNPs, protein structure, and disease , 2001, Human mutation.

[7]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[8]  C DeLisi,et al.  Estimating the number of protein folds. , 1998, Journal of molecular biology.

[9]  C. Chothia,et al.  Population statistics of protein structures: lessons from structural classifications. , 1997, Current opinion in structural biology.

[10]  Jérôme Gouzy,et al.  ProDom: Automated Clustering of Homologous Domains , 2002, Briefings Bioinform..

[11]  Liisa Holm,et al.  RSDB: representative protein sequence databases have high information content , 2000, Bioinform..

[12]  C. Chothia,et al.  The geometry of domain combination in proteins. , 2002, Journal of molecular biology.

[13]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[14]  Arne Elofsson,et al.  Profile–profile methods provide improved fold‐recognition: A study of different profile–profile alignment methods , 2004, Proteins.

[15]  Gaetano T Montelione,et al.  Automatic target selection for structural genomics on eukaryotes , 2004, Proteins.

[16]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[17]  David A. Lee,et al.  Progress towards mapping the universe of protein folds , 2004, Genome Biology.

[18]  Kimmen Sjölander,et al.  COACH : profile-profile alignment of protein families using hidden Markov models , 2003 .

[19]  C. Orengo,et al.  One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions. , 2002, Journal of molecular biology.

[20]  Kevin Karplus,et al.  Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set , 2001, Bioinform..

[21]  Owen White,et al.  The TIGRFAMs database of protein families , 2003, Nucleic Acids Res..

[22]  Tim J. P. Hubbard,et al.  SCOP: a Structural Classification of Proteins database , 1999, Nucleic Acids Res..

[23]  C. Chothia Proteins. One thousand families for the molecular biologist. , 1992, Nature.

[24]  Benjamin A. Shoemaker,et al.  CDD: a database of conserved domain alignments with links to domain three-dimensional structure , 2002, Nucleic Acids Res..

[25]  Dayhoff Mo,et al.  The origin and evolution of protein superfamilies. , 1976 .

[26]  N Linial,et al.  ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space , 1999, Proteins.

[27]  Liisa Holm,et al.  Picasso: generating a covering set of protein family profiles , 2001, Bioinform..

[28]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[29]  Cathy H. Wu,et al.  iProClass: an integrated, comprehensive and annotated protein classification database , 2001, Nucleic Acids Res..

[30]  Jérôme Gracy,et al.  Automated protein sequence database classification. II. Delineation Of domain boundaries from sequence similarities , 1998, Bioinform..

[31]  C. Chothia,et al.  Intermediate sequences increase the detection of homology between sequences. , 1997, Journal of molecular biology.

[32]  C. Orengo,et al.  From protein structure to function. , 1999, Current opinion in structural biology.

[33]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[34]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[35]  Anton J. Enright,et al.  GeneRAGE: a robust algorithm for sequence clustering and domain detection , 2000, Bioinform..

[36]  S. Brenner A tour of structural genomics , 2001, Nature Reviews Genetics.

[37]  E V Koonin,et al.  Estimating the number of protein folds and families from complete genome data. , 2000, Journal of molecular biology.

[38]  C. Chothia,et al.  Structure, function and evolution of multidomain proteins. , 2004, Current opinion in structural biology.

[39]  John Moult,et al.  A unifold, mesofold, and superfold model of protein fold use , 2002, Proteins.

[40]  D. Fischer,et al.  Analysis of singleton ORFans in fully sequenced microbial genomes , 2003, Proteins.

[41]  John Moult,et al.  Molecular modeling of protein function regions , 2004, Proteins.

[42]  Alexander Schliep,et al.  ProClust: improved clustering of protein sequences with an extended graph-based approach , 2002, ECCB.

[43]  S F Altschul,et al.  Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. , 1998, Trends in biochemical sciences.

[44]  Burkhard Rost,et al.  CHOP: parsing proteins into structural domains , 2004, Nucleic Acids Res..

[45]  Burkhard Rost,et al.  Target space for structural genomics revisited , 2002, Bioinform..

[46]  Ceslovas Venclovas,et al.  Assessment of progress over the CASP experiments , 2003, Proteins.

[47]  B. Rost,et al.  Sequence-based prediction of protein domains. , 2004, Nucleic acids research.

[48]  Z. X. Wang,et al.  A re-estimation for the total numbers of protein folds and superfamilies. , 1998, Protein engineering.

[49]  Helen M Berman,et al.  The Impact of Structural Genomics on the Protein Data Bank , 2004, American journal of pharmacogenomics : genomics-related research in drug development and clinical practice.

[50]  Ruben Recabarren,et al.  Estimating the total number of protein folds , 1999, Proteins.

[51]  Jérôme Gracy,et al.  Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment , 1998, Bioinform..

[52]  A. Sali,et al.  Comparative protein structure modeling of genes and genomes. , 2000, Annual review of biophysics and biomolecular structure.

[53]  D. Phillips,et al.  A possible three-dimensional structure of bovine alpha-lactalbumin based on that of hen's egg-white lysozyme. , 1969, Journal of molecular biology.

[54]  John C. Wootton,et al.  Non-globular Domains in Protein Sequences: Automated Segmentation Using Complexity Measures , 1994, Comput. Chem..

[55]  Pierre Brézellec,et al.  Cluster-C, an algorithm for the large-scale clustering of protein sequences based on the extraction of maximal cliques , 2004, Comput. Biol. Chem..

[56]  M J Sternberg,et al.  Progress in protein structure prediction: assessment of CASP3. , 1999, Current opinion in structural biology.

[57]  Anna Tramontano,et al.  Assessment of homology‐based predictions in CASP5 , 2003, Proteins.

[58]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[59]  Guang R. Gao,et al.  CASA: a server for the critical assessment of protein sequence alignment accuracy , 2002, Bioinform..

[60]  A. Sali,et al.  Protein Structure Prediction and Structural Genomics , 2001, Science.

[61]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[62]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.