Sublinear Selection Algorithms for Motif Finding

We consider the problem of identifying motifs, recurring or conserved patterns, in the sets of biological sequences. To solve this task, we present new deterministic and exact algorithms for finding patterns that are embedded as exact or inexact instances in all or most of the input strings. The proposed algorithms (1) improve search efficiency compared to existing exact algorithms by focusing search on a selected set of potential motif instances, and (2) scale well with the input length and the size of alphabet. While a variety of exact and probabilistic methods exist, our algorithms enhance pattern detection ability of these methods by (1) applying as a wrapper speed-up mechanism to a variety of common exact enumeration-based pattern finders, allowing to search for longer, less conserved motifs, (2) combining with probabilistic pattern finders as candidate selectors and accelerating search for pattern models. Our algorithms are orders of magnitude faster than existing exact algorithms for common pattern identification. We evaluate our algorithms on benchmark motif finding problems and real applications in biological sequence analysis and show that our algorithms exhibit significant running time improvements compared to the state-of-the-art approaches.

[1]  Jaime I. Dávila,et al.  Fast and Practical Algorithms for Planted (l, d) Motif Search , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[3]  Francis Y. L. Chin,et al.  Voting algorithms for discovering long motifs , 2005, APBC.

[4]  I. Gelfand,et al.  Strict rules determine arrangements of strands in sandwich proteins. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[5]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[6]  Brendan J. Frey,et al.  Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails , 2005, NIPS 2005.

[7]  Sanguthevar Rajasekaran,et al.  Space and Time Efficient Algorithms for Planted Motif Search , 2006, International Conference on Computational Science.

[8]  Christina S. Leslie,et al.  Fast String Kernels using Inexact Matching for Protein Sequences , 2004, J. Mach. Learn. Res..

[9]  Marie-France Sagot,et al.  RISOTTO: Fast Extraction of Motifs with Mismatches , 2006, LATIN.

[10]  Michael I. Jordan,et al.  A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences , 2002, NIPS.

[11]  Marie-France Sagot,et al.  Spelling Approximate Repeated or Common Motifs Using a Suffix Tree , 1998, LATIN.

[12]  Andrew D. Smith,et al.  Toward Optimal Motif Enumeration , 2003, WADS.

[13]  Eleazar Eskin,et al.  Finding composite regulatory patterns in DNA sequences , 2002, ISMB.

[14]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[15]  T. Sejnowski,et al.  Discovering Spike Patterns in Neuronal Responses , 2004, The Journal of Neuroscience.

[16]  Sanguthevar Rajasekaran,et al.  Exact Algorithms for Planted Motif Problems , 2005, J. Comput. Biol..