Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment

MOTIVATION Genome sequencing projects require the periodic application of analysis tools that can classify and multiply align related protein sequence domains. Full automation of this task requires an efficient integration of similarity and alignment techniques. RESULTS We have developed a fully automated process that classifies entire protein sequence databases, resulting in alignment of the homologous sequences. The successive steps of the procedure are based on compositional and local sequence similarity searches followed by multiple sequence alignments. Global similarities are detected from the pairwise comparison of amino acid and dipeptide compositions of each protein. After the elimination of all but one sequence from each detected cluster of closely related proteins, the remaining sequences are compiled in a suffix tree which is self-compared to detect local sequence similarities. Sets of proteins which share similar sequence segments are then weighted according to their closeness and multiply aligned using a fast hierarchical dynamic programming algorithm. Computational strategies were devised to minimize computer processing time and memory space requirements. The accuracy of the sequence classifications has been evaluated for 12 462 primary structures distributed over 341 known families. The percentage of sequences with missed or incorrect family assignments was 6.8% on the test set. This low error level is only twice that of the manually constructed PROSITE database ( 3.4% ) and is substantially better than that found for the automatically built PRODOM database ( 34.9% ). AVAILABILITY The resulting database, called DOMO, is available through database search routine SRS at Infobiogen (http://www.infobiogen.fr/srs5/), EBI (http://srs.ebi.ac.uk:5000/) and EMBL (http://www.embl-heidelberg.de/srs5/) World Wide Web sites. CONTACT gracy@infobiogen.fr

[1]  P. Argos,et al.  SRS: information retrieval system for molecular biology data banks. , 1996, Methods in enzymology.

[2]  M. Murata,et al.  Three-way Needleman--Wunsch algorithm. , 1990, Methods in enzymology.

[3]  Lawrence Hunter,et al.  Computationally Efficient Cluster Representation in Molecular Sequence Megaclassification , 1993, ISMB.

[4]  Kun-Mao Chao,et al.  Aligning two sequences within a specified diagonal band , 1992, Comput. Appl. Biosci..

[5]  W. Pearson Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[6]  Christophe Lefèvre,et al.  The position end-set tree: a small automaton for word recognition in biological sequences , 1993, Comput. Appl. Biosci..

[7]  W. C. Barker,et al.  The PIR-International Protein Sequence Database. , 1998, Nucleic acids research.

[8]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[9]  K. Nishikawa,et al.  Classification of proteins into groups based on amino acid composition and other characters. II. Grouping into four types. , 1983, Journal of biochemistry.

[10]  Pasquale Petrilli Classification of protein sequences by their dipeptide composition , 1993, Comput. Appl. Biosci..

[11]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[12]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[13]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[14]  P. Argos,et al.  An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. , 1995, Journal of molecular biology.

[15]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[16]  K. Nishikawa,et al.  Classification of proteins into groups based on amino acid composition and other characters. I. Angular distribution. , 1983, Journal of biochemistry.

[17]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[18]  S. Henikoff,et al.  Protein family classification based on searching a database of blocks. , 1994, Genomics.

[19]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[20]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[21]  S A Benner,et al.  Amino acid substitution during functionally constrained divergent evolution of protein sequences. , 1994, Protein engineering.

[22]  Jean-Michel Claverie,et al.  Information Enhancement Methods for Large Scale Sequence Analysis , 1993, Comput. Chem..

[23]  E. Sonnhammer,et al.  Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.

[24]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[25]  T. Attwood,et al.  PRINTS--a protein motif fingerprint database. , 1994, Protein engineering.

[26]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[27]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[28]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[29]  C Lefèvre,et al.  A fast word search algorithm for the representation of sequence similarity in genomic DNA. , 1994, Nucleic acids research.

[30]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[31]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[32]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its new supplement TREMBL , 1996, Nucleic Acids Res..

[33]  J. Sallantin,et al.  Multiple sequence alignment using anchor points through generalized dynamic programming , 1994 .

[34]  Jérôme Gracy,et al.  Automated protein sequence database classification. II. Delineation Of domain boundaries from sequence similarities , 1998, Bioinform..

[35]  Amos Bairoch,et al.  The PROSITE database, its status in 1995 , 1996, Nucleic Acids Res..

[36]  U. Hobohm,et al.  A sequence property approach to searching protein databases. , 1995, Journal of molecular biology.