Fast Motif Selection for Biological Sequences

We consider the problem of identifying motifs, recurring or conserved patterns, in the sets of biological sequences. To solve this task, we present new deterministic and exact algorithms for finding patterns that are embedded as exact or inexact instances in all or most of the input strings. The proposed algorithms (1) improve search efficiency compared to existing exact algorithms by focusing search on a selected set of potential motif instances, and (2) scale well with the input length and the size of alphabet.Our algorithms are orders of magnitude faster than existingexact algorithms for common pattern identification. We evaluate our algorithms on benchmark motif finding problemsand real applications in biological sequence analysis and show that they exhibit significant running time improvements compared to the state-of-the-art approaches.

[1]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[2]  Brendan J. Frey,et al.  Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails , 2005, NIPS 2005.

[3]  Marie-France Sagot,et al.  Spelling Approximate Repeated or Common Motifs Using a Suffix Tree , 1998, LATIN.

[4]  Eleazar Eskin,et al.  Finding composite regulatory patterns in DNA sequences , 2002, ISMB.

[5]  T. Sejnowski,et al.  Discovering Spike Patterns in Neuronal Responses , 2004, The Journal of Neuroscience.

[6]  Andrew D. Smith,et al.  Toward Optimal Motif Enumeration , 2003, WADS.

[7]  I. Gelfand,et al.  Strict rules determine arrangements of strands in sandwich proteins. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[9]  Francis Y. L. Chin,et al.  Voting algorithms for discovering long motifs , 2005, APBC.

[10]  Jaime I. Dávila,et al.  Fast and Practical Algorithms for Planted (l, d) Motif Search , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Michael I. Jordan,et al.  A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences , 2002, NIPS.

[12]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[13]  Marie-France Sagot,et al.  RISOTTO: Fast Extraction of Motifs with Mismatches , 2006, LATIN.