Efficient motif finding algorithms for large-alphabet inputs

BackgroundWe consider the problem of identifying motifs, recurring or conserved patterns, in the biological sequence data sets. To solve this task, we present a new deterministic algorithm for finding patterns that are embedded as exact or inexact instances in all or most of the input strings.ResultsThe proposed algorithm (1) improves search efficiency compared to existing algorithms, and (2) scales well with the size of alphabet. On a synthetic planted DNA motif finding problem our algorithm is over 10× more efficient than MITRA, PMSPrune, and RISOTTO for long motifs. Improvements are orders of magnitude higher in the same setting with large alphabets. On benchmark TF-binding site problems (FNP, CRP, LexA) we observed reduction in running time of over 12×, with high detection accuracy. The algorithm was also successful in rapidly identifying protein motifs in Lipocalin, Zinc metallopeptidase, and supersecondary structure motifs for Cadherin and Immunoglobin families.ConclusionsOur algorithm reduces computational complexity of the current motif finding algorithms and demonstrate strong running time improvements over existing exact algorithms, especially in important and difficult cases of large-alphabet sequences.

[1]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[2]  Brendan J. Frey,et al.  Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails , 2005, NIPS 2005.

[3]  Vladimir Pavlovic,et al.  Scalable Algorithms for String Kernels with Inexact Matching , 2008, NIPS.

[4]  Eleazar Eskin,et al.  Finding composite regulatory patterns in DNA sequences , 2002, ISMB.

[5]  C. Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Machine Learning.

[6]  T. Sejnowski,et al.  Discovering Spike Patterns in Neuronal Responses , 2004, The Journal of Neuroscience.

[7]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[8]  I. Gelfand,et al.  Strict rules determine arrangements of strands in sandwich proteins. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[9]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[10]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[11]  Francis Y. L. Chin,et al.  Voting algorithms for discovering long motifs , 2005, APBC.

[12]  Jaime I. Dávila,et al.  Fast and Practical Algorithms for Planted (l, d) Motif Search , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13]  Michael I. Jordan,et al.  A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences , 2002, NIPS.

[14]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[15]  Andrew D. Smith,et al.  Toward Optimal Motif Enumeration , 2003, WADS.

[16]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[17]  Christina S. Leslie,et al.  Fast String Kernels using Inexact Matching for Protein Sequences , 2004, J. Mach. Learn. Res..

[18]  Marie-France Sagot,et al.  RISOTTO: Fast Extraction of Motifs with Mismatches , 2006, LATIN.

[19]  Marie-France Sagot,et al.  Spelling Approximate Repeated or Common Motifs Using a Suffix Tree , 1998, LATIN.