Motif discovery without alignment or enumeration (extended abstract)

In this paper, we outline a novel coml&atoGl algorithm for ibe discovery of igid motifs contained in a set of input sequences. This is achieved without pair-wise aligmnent of the input sequences or emm~eration of the entire motif space (solution space>. Additionally, the reported motifs are guarauteed to be maximal in both length and compositiox~ Intemal repeats and patterns that repeat across equences are treated uniformly by the algorithm. Results on real datasets are briefly discussed. During the last twenty years, a number of algorithms have been devised for identifjkg sequence simihuity in amino or nucleic acid sequences. String aligmuent [lo] has been the underlying approach of choice for a large number of the resulting methods [3,7,4, 1,2, 11, 12, 13, 16, 283 which attempted to determine a minimmn cost consensus in the presence of allowable editing trausfonnations. The NP-hardness of the optimal aligmnent problem [15] compounded by the fact that aJigument-based algorithms could typically reveal only global simikities ]lS, 241 gave rise to a different approach, that of pattern or motifdiscovery ]17,29,24,22,21,19, IS, 30,25,26, 271. The implicit assumption underlying this new school of thought was that a frequently appearing motif ought to contribute to the fonctional behavior of the sequences that contain it antior speak of the sequences’ evohrtionary relationship. For a comprehensive survey of several of these algorithms the reader can refer to [ZO]. Pdon tomake digital/hard copies ofaU orpart ofthismataial for personal or classroom use is granted without fe provided that the copies are not made or diiiuted for p&t or commercial advantage, the ccpyri&tnotice,thetitle ofthepublicationanditsdateappw,andnotice~ given that copyright is by permission of the AChl, Inc. To copy othmti, to republ&, to post on servers or to rediiiute to lists. requires specific permission andfor fee. RECOMB 9s New York NY USA Copyright 1998 0-S979L976-9/9813.S.O0 Typically, these new algorithms were able to discover pattern types that span the spectrum from simple strings to general regular expressions. And the majority of these algorithms carried out an emuneration of the space in which the sought motifs resided, in a hypothesize-and-verifjr manner. Pattern-discovery methods are not void of problems either. For one, pattern discovery in its general form can be shown to be au NP-hard task [14]. This quickly led to the incorporation of heuristics in the proposed techuiques [21, 22, 24, 191; the heuristics improved performance but at the cost of diminished ff exibility. Also, many times the algorithms generated and reported patterns which were either not as specific as possible, or the result of information-theoretic pruning. We next present TEDUZS’S, a novel algorithm for the discovery of motifs. Section 2 contains a formal definition of the problem together with a description of the algorithm. In Section 3 we briefly discuss results on actual biological data Finally, conclusions and a discussion appear in Section 4.

[1]  Hamilton O. Smith,et al.  Finding sequence motifs in groups of functionally related proteins. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[2]  M. Waterman,et al.  Pattern recognition in several sequences: consensus and alignment. , 1984, Bulletin of mathematical biology.

[3]  Kaizhong Zhang,et al.  Combinatorial pattern discovery for scientific data: some preliminary results , 1994, SIGMOD '94.

[4]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[5]  Alain Viari,et al.  A Double Combinatorial Approach to Discovering Patterns in Biological Sequences , 1996, CPM.

[6]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[7]  D. Shasha,et al.  Discovering active motifs in sets of related protein sequences and using them for classification. , 1994, Nucleic acids research.

[8]  A. F. Neuwald,et al.  Detecting patterns in protein sequences. , 1994, Journal of molecular biology.

[9]  D. Higgins,et al.  Finding flexible patterns in unaligned protein sequences , 1995, Protein science : a publication of the Protein Society.

[10]  E. Uberbacher,et al.  A fast look-up algorithm for detecting repetitive DNA sequences , 1996 .

[11]  H. M. Martinez A flexible multiple sequence alignment program. , 1988, Nucleic acids research.

[12]  Douglas L. Brutlag,et al.  Enumerating and Ranking Discrete Motifs , 1997, ISMB.

[13]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank. , 1991, Nucleic acids research.

[14]  Douglas L. Brutlag,et al.  Identification of Protein Motifs Using Conserved Amino Acid Properties and Partitioning Techniques , 1995, ISMB.

[15]  M. Suyama,et al.  Searching for common sequence patterns among distantly related proteins. , 1995, Protein engineering.

[16]  Amos Bairoch,et al.  The PROSITE database, its status in 1995 , 1996, Nucleic Acids Res..

[17]  A. Delcoigne,et al.  Sequence comparison by dynamic programming , 1975 .

[18]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[19]  David Maier,et al.  The Complexity of Some Problems on Subsequences and Supersequences , 1978, JACM.

[20]  Alain Viari,et al.  Multiple Sequence Comparison: A Peptide Matching Approach , 1995, CPM.

[21]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[22]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[23]  H. M. Martinez,et al.  A multiple sequence alignment program , 1986, Nucleic Acids Res..

[24]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[25]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[26]  R. F. Smith,et al.  Automatic generation of primary sequence patterns from sets of related protein sequences. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[27]  H. M. Martinez,et al.  An efficient method for finding repeats in molecular sequences , 1983, Nucleic Acids Res..

[28]  M. Waterman,et al.  A method for fast database search for all k-nucleotide repeats. , 1994, Nucleic acids research.

[29]  G. Arents,et al.  Topography of the histone octamer surface: repeating structural motifs utilized in the docking of nucleosomal DNA. , 1993, Proceedings of the National Academy of Sciences of the United States of America.