The CATH extended protein‐family database: Providing structural annotations for genome sequences

An automatic sequence search and analysis protocol (DomainFinder) based on PSI‐BLAST and IMPALA, and using conservative thresholds, has been developed for reliably integrating gene sequences from GenBank into their respective structural families within the CATH domain database (http://www.biochem.ucl.ac.uk/bsm/cath_new). DomainFinder assigns a new gene sequence to a CATH homologous superfamily provided that PSI‐BLAST identifies a clear relationship to at least one other Protein Data Bank sequence within that superfamily. This has resulted in an expansion of the CATH protein family database (CATH‐PFDB v1.6) from 19,563 domain structures to 176,597 domain sequences. A further 50,000 putative homologous relationships can be identified using less stringent cut‐offs and these relationships are maintained within neighbour tables in the CATH Oracle database, pending further evidence of their suggested evolutionary relationship. Analysis of the CATH‐PFDB has shown that only 15% of the sequence families are close enough to a known structure for reliable homology modeling. IMPALA/PSI‐BLAST profiles have been generated for each of the sequence families in the expanded CATH‐PFDB and a web server has been provided so that new sequences may be scanned against the profile library and be assigned to a structure and homologous superfamily.

[1]  W R Taylor,et al.  Protein structure alignment. , 1989, Journal of molecular biology.

[2]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[3]  C. Chothia Proteins. One thousand families for the molecular biologist. , 1992, Nature.

[4]  David T. Jones,et al.  Protein superfamilles and domain superfolds , 1994, Nature.

[5]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[6]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[7]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[8]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[9]  C. Chothia,et al.  Intermediate sequences increase the detection of homology between sequences. , 1997, Journal of molecular biology.

[10]  J M Thornton,et al.  Domain assignment for protein structures using a consensus approach: Characterization and analysis , 1998, Protein science : a publication of the Protein Society.

[11]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[12]  P Bork,et al.  Homology-based fold predictions for Mycoplasma genitalium proteins. , 1998, Journal of molecular biology.

[13]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[14]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[15]  C A Orengo,et al.  Genome analysis: Assigning protein coding regions to three‐dimensional structures , 1999 .

[16]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[17]  C. Orengo,et al.  Evolution of protein function, from a structural perspective. , 1999, Current opinion in chemical biology.

[18]  M. Sternberg,et al.  Benchmarking PSI-BLAST in genome annotation. , 1999, Journal of molecular biology.

[19]  Alejandro A. Schäffer,et al.  IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices , 1999, Bioinform..

[20]  C A Orengo,et al.  Combining sensitive database searches with multiple intermediates to detect distant homologues. , 1999, Protein engineering.

[21]  P. Bork,et al.  Homology among (betaalpha)(8) barrels: implications for the evolution of metabolic pathways. , 2000, Journal of molecular biology.

[22]  O. Vallon New sequence motifs in flavoproteins: Evidence for common ancestry and tools to predict structure , 2000, Proteins.

[23]  P. Bork,et al.  Homology among (βα) 8 barrels: implications for the evolution of metabolic pathways 1 1Edited by G. Von Heijne , 2000 .

[24]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[25]  Sarah A. Teichmann,et al.  Fast assignment of protein structures to sequences using the Intermediate Sequence Library PDB-ISL , 2000, Bioinform..

[26]  James E. Bray,et al.  Assigning genomic sequences to CATH , 2000, Nucleic Acids Res..

[27]  Frances M. G. Pearl,et al.  The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologues. , 2000, Protein engineering.

[28]  N. Grishin Fold change in evolution of protein structures. , 2001, Journal of structural biology.

[29]  Annabel E. Todd,et al.  Evolution of function in protein superfamilies, from a structural perspective. , 2001, Journal of molecular biology.

[30]  James E. Bray,et al.  A rapid classification protocol for the CATH Domain Database to support structural genomics , 2001, Nucleic Acids Res..

[31]  Roman A. Laskowski,et al.  PDBsum: summaries and analyses of PDB structures , 2001, Nucleic Acids Res..