Sequence clustering strategies improve remote homology recognitions while reducing search times.

Sequence databases are rapidly growing, thereby increasing the coverage of protein sequence space, but this coverage is uneven because most sequencing efforts have concentrated on a small number of organisms. The resulting granularity of sequence space creates many problems for profile-based sequence comparison programs. In this paper, we suggest several strategies that address these problems, and at the same time speed up the searches for homologous proteins and improve the ability of profile methods to recognize distant homologies. One of our strategies combines database clustering, which removes highly redundant sequence, and a two-step PSI-BLAST (PDB-BLAST), which separates sequence spaces of profile composition and space of homology searching. The combination of these strategies improves distant homology recognitions by more than 100%, while using only 10% of the CPU time of the standard PSI-BLAST search. Another method, intermediate profile searches, allows for the exploration of additional search directions that are normally dominated by large protein sub-families within very diverse families. All methods are evaluated with a large fold-recognition benchmark.

[1]  T. Blundell,et al.  Comparative protein modelling by satisfaction of spatial restraints. , 1993, Journal of molecular biology.

[2]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[3]  C. Chothia,et al.  Intermediate sequences increase the detection of homology between sequences. , 1997, Journal of molecular biology.

[4]  J. Thompson,et al.  Multiple sequence alignment with Clustal X. , 1998, Trends in biochemical sciences.

[5]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[6]  Chris Sander,et al.  Removing near-neighbour redundancy from large protein sequence collections , 1998, Bioinform..

[7]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[8]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[9]  C A Orengo,et al.  Combining sensitive database searches with multiple intermediates to detect distant homologues. , 1999, Protein engineering.

[10]  Liisa Holm,et al.  RSDB: representative protein sequence databases have high information content , 2000, Bioinform..

[11]  Sarah A. Teichmann,et al.  Fast assignment of protein structures to sequences using the Intermediate Sequence Library PDB-ISL , 2000, Bioinform..

[12]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[13]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[14]  A. Godzik,et al.  Comparison of sequence profiles. Strategies for structural predictions using sequence information , 2008, Protein science : a publication of the Protein Society.

[15]  Adam Godzik,et al.  Saturated BLAST: an automated multiple intermediate sequence search used to detect distant homology , 2000, Bioinform..

[16]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[17]  Adam Godzik,et al.  Tolerating some redundancy significantly speeds up clustering of large protein databases , 2002, Bioinform..