Voting algorithms for discovering long motifs

Pevzner and Sze [14] have introduced the Planted (l,d)-Motif Problem to find the similar patterns (motif) in sequences which represent the promoter region of co-regulated genes. l is the length of the motif and d is the maximum Hamming distance around the similar patterns. Many algorithms have been developed to solve this motif problem. However, these algorithms have the problems that the running times are too long or do not guarantee the motif can be found. In this paper, we introduce new algorithms to solve the motif problem. Our algorithms can find motif in reasonable time not only for the challenging (9,2), (11,3), (15,5)motif problems but even for longer motifs, say (20,7), (30,11) and (40,15), which has never been seriously attempted by other researchers because of the high time and space complexities.

[1]  Grit Herrmann,et al.  Identification of functional elements in unaligned nucleic acid sequences by a novel tuple search algorithm , 1996, Comput. Appl. Biosci..

[2]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[3]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[4]  Rodger Staden,et al.  Methods for discovering novel motifs in nucleic acid sequences , 1989, Comput. Appl. Biosci..

[5]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[6]  G. Stormo,et al.  Identification of consensus patterns in unaligned dna and protein sequences: a large-deviation stati , 1995 .

[7]  Bin Ma,et al.  Finding similar regions in many strings , 1999, STOC '99.

[8]  Shoudan Liang,et al.  cWINNOWER algorithm for finding fuzzy DNA motifs , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[9]  Marie-France Sagot,et al.  Spelling Approximate Repeated or Common Motifs Using a Suffix Tree , 1998, LATIN.

[10]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[11]  Michael Q. Zhang,et al.  SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[12]  E. Koonin,et al.  Prediction of transcription regulatory sites in Archaea by a comparative genomic approach. , 2000, Nucleic acids research.

[13]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[14]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[15]  C. Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Machine Learning.

[16]  Hanah Margalit,et al.  Identification of common motifs in unaligned DNA sequences: application to Escherichia coli Lrp regulon , 1995, Comput. Appl. Biosci..

[17]  Norishige Chiba,et al.  Arboricity and Subgraph Listing Algorithms , 1985, SIAM J. Comput..

[18]  G. Pesole,et al.  WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences. , 1992, Nucleic acids research.

[19]  Martin Tompa,et al.  An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem , 1999, ISMB.