discovery of common patterns in sequences over large alphabets

We consider the problem of identifying motifs, recurring or conserved patterns, in the data modeled as strings or sequences. In particular, we present a new deterministic algorithm for finding patterns that are embedded as exact or inexact instances in all or most of the input strings. The proposed algorithm (1) improves search efficiency compared to existing algorithms, and (2) scales well with the size of alphabet. Our algorithm is several orders of magnitude faster than existing deterministic algorithms for common pattern identification. We evaluate our algorithm on benchmark motif finding problems and real applications in biological sequence analysis and show that our algorithm maintains predictive performance with significant running time improvements.

[1]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[2]  Francis Y. L. Chin,et al.  Voting algorithms for discovering long motifs , 2005, APBC.

[3]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[4]  Christina S. Leslie,et al.  Fast String Kernels using Inexact Matching for Protein Sequences , 2004, J. Mach. Learn. Res..

[5]  Marie-France Sagot,et al.  RISOTTO: Fast Extraction of Motifs with Mismatches , 2006, LATIN.

[6]  Eleazar Eskin,et al.  Finding composite regulatory patterns in DNA sequences , 2002, ISMB.

[7]  Sanguthevar Rajasekaran,et al.  Exact Algorithms for Planted Motif Problems , 2005, J. Comput. Biol..

[8]  Sanguthevar Rajasekaran,et al.  Space and Time Efficient Algorithms for Planted Motif Search , 2006, International Conference on Computational Science.

[9]  Andrew D. Smith,et al.  Toward Optimal Motif Enumeration , 2003, WADS.

[10]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[11]  Jaime I. Dávila,et al.  Fast and Practical Algorithms for Planted (l, d) Motif Search , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  Michael I. Jordan,et al.  A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences , 2002, NIPS.

[13]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[14]  Brendan J. Frey,et al.  Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails , 2005, NIPS 2005.

[15]  T. Sejnowski,et al.  Discovering Spike Patterns in Neuronal Responses , 2004, The Journal of Neuroscience.

[16]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Vladimir Pavlovic,et al.  Scalable Algorithms for String Kernels with Inexact Matching , 2008, NIPS.

[18]  Marie-France Sagot,et al.  Spelling Approximate Repeated or Common Motifs Using a Suffix Tree , 1998, LATIN.