Assessing strategies for improved superfamily recognition

There are more than 200 completed genomes and over 1 million nonredundant sequences in public repositories. Although the structural data are more sparse (∼13,000 nonredundant structures solved to date), several powerful sequence‐based methodologies now allow these structures to be mapped onto related regions in a significant proportion of genome sequences. We review a number of publicly available strategies for providing structural annotations for genome sequences, and we describe the protocol adopted to provide CATH structural annotations for completed genomes. In particular, we assess the performance of several sequence‐based protocols employing Hidden Markov model (HMM) technologies for superfamily recognition, including a new approach (SAMOSA [sequence augmented models of structure alignments]) that exploits multiple structural alignments from the CATH domain structure database when building the models. Using a data set of remote homologs detected by structure comparison and manually validated in CATH, a single‐seed HMM library was able to recognize 76% of the data set. Including the SAMOSA models in the HMM library showed little gain in homolog recognition, although a slight improvement in alignment quality was observed for very remote homologs. However, using an expanded 1D‐HMM library, CATH‐ISL increased the coverage to 86%. The single‐seed HMM library has been used to annotate the protein sequences of 120 genomes from all three major kingdoms, allowing up to 70% of the genes or partial genes to be assigned to CATH superfamilies. It has also been used to recruit sequences from Swiss‐Prot and TrEMBL into CATH domain superfamilies, expanding the CATH database eightfold.

[1]  C. Orengo CORA—Topological fingerprints for protein structural families , 2008, Protein science : a publication of the Protein Society.

[2]  Robert D. Finn,et al.  The Pfam protein families database , 2007, Nucleic Acids Res..

[3]  Frances M. G. Pearl,et al.  The CATH domain structure database. , 2005, Methods of biochemical analysis.

[4]  David A. Lee,et al.  A structural perspective on genome evolution , 2004, RECOMB.

[5]  Daniel W. A. Buchan,et al.  Evolution of protein superfamilies and bacterial genome size. , 2004, Journal of molecular biology.

[6]  J. Jonsson,et al.  Two-dimensional conformation-dependent electrophoresis (2D-CDE) to separate DNA fragments containing unmatched bulge from complex DNA samples. , 2004, Nucleic acids research.

[7]  Frances M. G. Pearl,et al.  Recognizing the fold of a protein structure , 2003, Bioinform..

[8]  M. Madera,et al.  A comparison of profile hidden Markov model procedures for remote homology detection. , 2002, Nucleic acids research.

[9]  Sam Griffiths-Jones,et al.  The use of structure information to increase alignment accuracy does not aid homologue detection with profile HMMs , 2002, Bioinform..

[10]  David T. Jones,et al.  Getting the most from PSI-BLAST. , 2002, Trends in biochemical sciences.

[11]  Frances M. G. Pearl,et al.  Gene3D: structural assignment for whole genes and genomes using the CATH domain structure database. , 2002, Genome research.

[12]  Arne Elofsson,et al.  A study on protein sequence alignment quality , 2002, Proteins.

[13]  Frances M. G. Pearl,et al.  The CATH extended protein‐family database: Providing structural annotations for genome sequences , 2002, Protein science : a publication of the Protein Society.

[14]  Kevin Karplus,et al.  Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set , 2001, Bioinform..

[15]  Frances M. G. Pearl,et al.  Review: what can structural classifications reveal about protein evolution? , 2001, Journal of structural biology.

[16]  M. Sternberg,et al.  Enhanced genome annotation using structural profiles in the program 3D-PSSM. , 2000, Journal of molecular biology.

[17]  M. Sternberg,et al.  Benchmarking PSI-BLAST in genome annotation. , 1999, Journal of molecular biology.

[18]  C A Orengo,et al.  Combining sensitive database searches with multiple intermediates to detect distant homologues. , 1999, Protein engineering.

[19]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[20]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[21]  Tim J. P. Hubbard,et al.  SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[22]  C. Chothia,et al.  Intermediate sequences increase the detection of homology between sequences. , 1997, Journal of molecular biology.

[23]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[24]  David Scott,et al.  The PRINTS Database of Protein Fingerprints: A Novel Information Resource for Computational Molecular Biology , 1997, J. Chem. Inf. Comput. Sci..

[25]  D Eisenberg,et al.  A 3D-1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence. , 1997, Journal of molecular biology.

[26]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[27]  W R Taylor,et al.  Protein structure alignment. , 1989, Journal of molecular biology.

[28]  M. Sternberg,et al.  A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. , 1987, Journal of molecular biology.

[29]  William R. Taylor,et al.  Multiple sequence alignment by a pairwise algorithm , 1987, Comput. Appl. Biosci..

[30]  A. Lesk,et al.  How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. , 1980, Journal of molecular biology.

[31]  Cyrus Chothia,et al.  The SUPERFAMILY database in 2004: additions and improvements , 2004, Nucleic Acids Res..

[32]  David L. Wheeler,et al.  GenBank: update , 2004, Nucleic Acids Res..

[33]  James E. Bray,et al.  Gene3D: structural assignments for the biologist and bioinformaticist alike , 2003, Nucleic Acids Res..

[34]  James E. Bray,et al.  A rapid classification protocol for the CATH Domain Database to support structural genomics , 2001, Nucleic Acids Res..

[35]  James E. Bray,et al.  Assigning genomic sequences to CATH , 2000, Nucleic Acids Res..

[36]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..