An Efficient Exact Algorithm for the Motif Stem Search Problem over Large Alphabets

In recent years, there has been an increasing interest in planted (l, d) motif search (PMS) with applications to discovering significant segments in biological sequences. However, there has been little discussion about PMS over large alphabets. This paper focuses on motif stem search (MSS), which is recently introduced to search motifs on large-alphabet inputs. A motif stem is an l-length string with some wildcards. The goal of the MSS problem is to find a set of stems that represents a superset of all (l , d) motifs present in the input sequences, and the superset is expected to be as small as possible. The three main contributions of this paper are as follows: (1) We build motif stem representation more precisely by using regular expressions. (2) We give a method for generating all possible motif stems without redundant wildcards. (3) We propose an efficient exact algorithm, called StemFinder, for solving the MSS problem. Compared with the previous MSS algorithms, StemFinder runs much faster and reports fewer stems which represent a smaller superset of all (l, d) motifs. StemFinder is freely available at http://sites.google.com/site/feqond/stemfinder.

[1]  Michael B. Yaffe,et al.  Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs , 2003, Nucleic Acids Res..

[2]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[3]  Yun Xu,et al.  An improved voting algorithm for planted (l, d) motif search , 2013, Inf. Sci..

[4]  Marie-France Sagot,et al.  RISOTTO: Fast Extraction of Motifs with Mismatches , 2006, LATIN.

[5]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[6]  Qiang Yu,et al.  A Heuristic Cluster-Based EM Algorithm for the Planted (L, d) Problem , 2013, J. Bioinform. Comput. Biol..

[7]  P. D’haeseleer What are DNA sequence motifs? , 2006, Nature Biotechnology.

[8]  Eric S. Ho,et al.  iTriplet, a rule-based nucleic acid sequence motif finder , 2009, Algorithms for Molecular Biology.

[9]  Eleazar Eskin,et al.  Finding composite regulatory patterns in DNA sequences , 2002, ISMB.

[10]  Sanguthevar Rajasekaran,et al.  Minimotif Miner 3.0: database expansion and significantly improved reduction of false-positive predictions from consensus sequences , 2011, Nucleic Acids Res..

[11]  Richard J. Edwards,et al.  ELM—the database of eukaryotic linear motifs , 2011, Nucleic Acids Res..

[12]  Richard J. Edwards,et al.  SLiMFinder: A Probabilistic Method for Identifying Over-Represented, Convergently Evolved, Short Linear Motifs in Proteins , 2007, PloS one.

[13]  Jeffrey Scott Vitter,et al.  StemFinder: An efficient algorithm for searching motif stems over large alphabets , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

[14]  Qiang Yu,et al.  PairMotif: A New Pattern-Driven Algorithm for Planted (l, d) DNA Motif Search , 2012, PloS one.

[15]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[16]  Francis Y. L. Chin,et al.  Voting algorithms for discovering long motifs , 2005, APBC.

[17]  Vladimir Pavlovic,et al.  Efficient motif finding algorithms for large-alphabet inputs , 2010, BMC Bioinformatics.

[18]  D UllmanJeffrey,et al.  Introduction to automata theory, languages, and computation, 2nd edition , 2001 .

[19]  Sanguthevar Rajasekaran,et al.  Efficient algorithms for biological stems search , 2013, BMC Bioinformatics.

[20]  Richard J. Edwards,et al.  SLiMSearch 2.0: biological context for short linear motifs in proteins , 2011, Nucleic Acids Res..

[21]  T. D. Schneider,et al.  Consensus sequence Zen. , 2002, Applied bioinformatics.

[22]  Sanguthevar Rajasekaran,et al.  PMS5: an efficient exact algorithm for the (ℓ, d)-motif finding problem , 2011, BMC Bioinformatics.

[23]  Graziano Pesole,et al.  An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[24]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[25]  Todd Wareham,et al.  On the complexity of finding common approximate substrings , 2003, Theor. Comput. Sci..

[26]  Zhi-Zhong Chen,et al.  Fast Exact Algorithms for the Closest String and Substring Problems with Application to the Planted (L,d)-Motif Model , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[27]  Sanguthevar Rajasekaran,et al.  Minimotif miner 2nd release: a database and web system for motif search , 2008, Nucleic Acids Res..

[28]  Sanguthevar Rajasekaran,et al.  qPMS7: A Fast Algorithm for Finding (ℓ, d)-Motifs in DNA and Protein Sequences , 2012, PloS one.

[29]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[30]  Sanguthevar Rajasekaran,et al.  Achieving High Accuracy Prediction of Minimotifs , 2012, PloS one.

[31]  Jaime I. Dávila,et al.  Fast and Practical Algorithms for Planted (l, d) Motif Search , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  Richard J. Edwards,et al.  SLiMFinder: a web server to find novel, significantly over-represented, short protein motifs , 2010, Nucleic Acids Res..

[33]  Vladimir Pavlovic,et al.  Fast Motif Selection for Biological Sequences , 2009, 2009 IEEE International Conference on Bioinformatics and Biomedicine.