Detection of homologous proteins by an intermediate sequence search

We developed a variant of the intermediate sequence search method (ISSnew) for detection and alignment of weakly similar pairs of protein sequences. ISSnew relates two query sequences by an intermediate sequence that is potentially homologous to both queries. The improvement was achieved by a more robust overlap score for a match between the queries through an intermediate. The approach was benchmarked on a data set of 2369 sequences of known structure with insignificant sequence similarity to each other (BLAST E‐value larger than 0.001); 2050 of these sequences had a related structure in the set. ISSnew performed significantly better than both PSI‐BLAST and a previously described intermediate sequence search method. PSI‐BLAST could not detect correct homologs for 1619 of the 2369 sequences. In contrast, ISSnew assigned a correct homolog as the top hit for 121 of these 1619 sequences, while incorrectly assigning homologs for only nine targets; it did not assign homologs for the remainder of the sequences. By estimate, ISSnew may be able to assign the folds of domains in ∼29,000 of the ∼500,000 sequences unassigned by PSI‐BLAST, with 90% specificity (1 − false positives fraction). In addition, we show that the 15 alignments with the most significant BLAST E‐values include the nearly best alignments constructed by ISSnew.

[1]  G. Sermonti The human genome. , 1988, Rivista di biologia.

[2]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[3]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[4]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[5]  M. Sternberg,et al.  Towards an automatic method of predicting protein structure by homology: an evaluation of suboptimal sequence alignments. , 1992, Protein engineering.

[6]  T. Blundell,et al.  Comparative protein modelling by satisfaction of spatial restraints. , 1993, Journal of molecular biology.

[7]  G. Barton Scop: structural classification of proteins. , 1994, Trends in biochemical sciences.

[8]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[9]  A. Sali,et al.  Crystal structure of the delta' subunit of the clamp-loader complex of E. coli DNA polymerase III. , 1997, Cell.

[10]  Andrej Sali,et al.  Crystal Structure of the δ′ Subunit of the Clamp-Loader Complex of E. coli DNA Polymerase III , 1997, Cell.

[11]  C. Chothia,et al.  Intermediate sequences increase the detection of homology between sequences. , 1997, Journal of molecular biology.

[12]  R Sánchez,et al.  Advances in comparative protein-structure modelling. , 1997, Current opinion in structural biology.

[13]  A. Sali 100,000 protein structures for the biologist , 1998, Nature Structural Biology.

[14]  C. Cantor,et al.  Massive attack on high-throughput biology , 1998, Nature Genetics.

[15]  M Levitt,et al.  Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins , 1998, Protein science : a publication of the Protein Society.

[16]  A. Sali,et al.  Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[17]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[18]  Mark Gerstein,et al.  Measurement of the effectiveness of transitive sequence comparison, through a third 'intermediate' sequence , 1998, Bioinform..

[19]  Raffaele Giancarlo,et al.  Sequence alignment in molecular biology , 1998, Mathematical Support for Molecular Biology.

[20]  A. Sali,et al.  Structural genomics: beyond the Human Genome Project , 1999, Nature Genetics.

[21]  C A Orengo,et al.  Combining sensitive database searches with multiple intermediates to detect distant homologues. , 1999, Protein engineering.

[22]  Stephen K Burley,et al.  Structural genomics , 1999, Current Biology.

[23]  Thomas C. Terwilliger,et al.  Structural genomics in North America , 2000, Nature Structural Biology.

[24]  Sarah A. Teichmann,et al.  Fast assignment of protein structures to sequences using the Intermediate Sequence Library PDB-ISL , 2000, Bioinform..

[25]  R Sánchez,et al.  Comparative protein structure modeling. Introduction and practical examples with modeller. , 2000, Methods in molecular biology.

[26]  M Kann,et al.  Optimization of a new score function for the detection of remote homologs , 2000, Proteins.

[27]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[28]  Tim J. P. Hubbard,et al.  SCOP: a Structural Classification of Proteins database , 2000, Nucleic Acids Res..

[29]  Adam Godzik,et al.  Saturated BLAST: an automated multiple intermediate sequence search used to detect distant homology , 2000, Bioinform..

[30]  Chris Sander,et al.  Completeness in structural genomics , 2001, Nature Structural Biology.

[31]  Monica Riley,et al.  A functional update of the Escherichia coli K-12 genome , 2001, Genome Biology.

[32]  F E Cohen,et al.  Pairwise sequence alignment below the twilight zone. , 2001, Journal of molecular biology.

[33]  Michael T. Goodrich,et al.  Algorithm Design: Foundations, Analysis, and Internet Examples , 2001 .

[34]  M J Sippl,et al.  Assessment of the CASP4 fold recognition category , 2001, Proteins.

[35]  A. Godzik,et al.  Sequence clustering strategies improve remote homology recognitions while reducing search times. , 2002, Protein engineering.

[36]  Narayanan Eswar,et al.  MODBASE, a database of annotated comparative protein structure models , 2002, Nucleic Acids Res..

[37]  E. Winzeler,et al.  Treasures and traps in genome-wide data sets: case examples from yeast , 2002, Nature Reviews Genetics.

[38]  Alexander Schliep,et al.  ProClust: improved clustering of protein sequences with an extended graph-based approach , 2002, ECCB.

[39]  Paul W. Fitzjohn,et al.  In silico protein recombination: enhancing template and sequence alignment selection for comparative protein modelling. , 2003, Journal of molecular biology.

[40]  A. Sali,et al.  Comparative protein structure modeling by iterative alignment, model building and model assessment. , 2003, Nucleic acids research.

[41]  E. Pennisi Human genome. A low number wins the GeneSweep Pool. , 2003, Science.

[42]  E. Pennisi A Low Number Wins the GeneSweep Pool , 2003, Science.