Systematic and Fully Automated Identification of Protein Sequence Patterns

We present an efficient algorithm to systematically and automatically identify patterns in protein sequence families. The procedure is based on the Splash deterministic pattern discovery algorithm and on a framework to assess the statistical significance of patterns. We demonstrate its application to the fully automated discovery of patterns in 974 PROSITE families (the complete subset of PROSITE families which are defined by patterns and contain DR records). Splash generates patterns with better specificity and undiminished sensitivity, or vice versa, in 28% of the families; identical statistics were obtained in 48% of the families, worse statistics in 15%, and mixed behavior in the remaining 9%. In about 75% of the cases, Splash patterns identify sequence sites that overlap more than 50% with the corresponding PROSITE pattern. The procedure is sufficiently rapid to enable its use for daily curation of existing motif and profile databases. Third, our results show that the statistical significance of discovered patterns correlates well with their biological significance. The trypsin subfamily of serine proteases is used to illustrate this method's ability to exhaustively discover all motifs in a family that are statistically and biologically significant. Finally, we discuss applications of sequence patterns to multiple sequence alignment and the training of more sensitive score-based motif models, akin to the procedure used by PSI-BLAST. All results are available at httpl//www.research.ibm.com/spat/.

[1]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[2]  P. Kraulis A program to produce both detailed and schematic plots of protein structures , 1991 .

[3]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[4]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[5]  Inge Jonassen,et al.  Efficient discovery of conserved patterns using a pattern graph , 1997, Comput. Appl. Biosci..

[6]  Jun S. Liu,et al.  Gibbs motif sampling: Detection of bacterial outer membrane protein repeats , 1995, Protein science : a publication of the Protein Society.

[7]  P. Bork,et al.  Protein sequence motifs. , 1996, Current opinion in structural biology.

[8]  Terri K. Attwood,et al.  PRINTS prepares for the new millennium , 1999, Nucleic Acids Res..

[9]  Andrea Califano,et al.  SPLASH: structural pattern localization analysis by sequential histograms , 2000, Bioinform..

[10]  A. Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[11]  A. F. Neuwald,et al.  Detecting patterns in protein sequences. , 1994, Journal of molecular biology.

[12]  Shmuel Pietrokovski,et al.  Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations , 1999, Bioinform..

[13]  Aris Floratos,et al.  Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..

[14]  E. Sonnhammer,et al.  Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.

[15]  D. Brutlag,et al.  Highly specific protein sequence motifs for genome analysis. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[16]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Paul Barry,et al.  Programming Perl 3rd Edition , 2000 .

[18]  Aris Floratos,et al.  An Approximation Algorithm for Alignment of Multiple Sequences using Motif Discovery , 1999, J. Comb. Optim..

[19]  Robert D. Finn,et al.  Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins , 1999, Nucleic Acids Res..

[20]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[21]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[22]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[23]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[24]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.