An improved voting algorithm for planted (l, d) motif search

The planted motif search problem is a classical problem in bioinformatics that seeks to identify meaningful patterns in biological sequences. As an NP-complete problem, current algorithms focus on improving the average time complexity and solving challenging instances within an acceptable time. In this paper, we propose a new exact algorithm CVoting that improves the state-of-the-art Voting algorithm. CVoting uses a new hash technique to reduce the space complexity to O(mn+N(l,d)) and a new pruning technique to reduce the average time complexity to Om^2nN(l,d)14+3l^l. Experimental results show that CVoting outperforms competing algorithms, including PMS1, RISOTTO, Voting and Pmsprune, in both space and time: up to an order of magnitude faster and using less memory in solving challenging instances. The software of the proposed algorithm is publicly available at http://staff.ustc.edu.cn/xuyun/motif.

[1]  Amedeo Napoli,et al.  Mining gene expression data with pattern structures in formal concept analysis , 2011, Inf. Sci..

[2]  Sriram Ramabhadran,et al.  Finding subtle motifs by branching from sample strings , 2003, ECCB.

[3]  Jaime I. Dávila,et al.  Fast and Practical Algorithms for Planted (l, d) Motif Search , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[5]  Wen-Jing Hsu,et al.  Tree-structured algorithm for long weak motif discovery , 2011, Bioinform..

[6]  Marie-France Sagot,et al.  Spelling Approximate Repeated or Common Motifs Using a Suffix Tree , 1998, LATIN.

[7]  Sartaj Sahni,et al.  PMS6: A Fast Algorithm for Motif Discovery. , 2012, IEEE ... International Conference on Computational Advances in Bio and Medical Sciences : [proceedings]. IEEE International Conference on Computational Advances in Bio and Medical Sciences.

[8]  Douglas L. Brutlag,et al.  Sequence Motifs: Highly Predictive Features of Protein Function , 2006, Feature Extraction.

[9]  P. D’haeseleer What are DNA sequence motifs? , 2006, Nature Biotechnology.

[10]  Amar Mukherjee,et al.  New Algorithms for Finding Monad Patterns in DNA Sequences , 2004, SPIRE.

[11]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[12]  Francis Y. L. Chin,et al.  Voting algorithms for discovering long motifs , 2005, APBC.

[13]  Wen-Jing Hsu,et al.  RecMotif: a novel fast algorithm for weak motif discovery , 2010, BMC Bioinformatics.

[14]  Eric S. Ho,et al.  iTriplet, a rule-based nucleic acid sequence motif finder , 2009, Algorithms for Molecular Biology.

[15]  Eleazar Eskin,et al.  Finding composite regulatory patterns in DNA sequences , 2002, ISMB.

[16]  Sanguthevar Rajasekaran,et al.  Exact Algorithms for Planted Motif Problems , 2005, J. Comput. Biol..

[17]  Sun-Yuan Hsieh,et al.  An Improved Heuristic Algorithm for Finding Motif Signals in DNA Sequences , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Marie-France Sagot,et al.  RISOTTO: Fast Extraction of Motifs with Mismatches , 2006, LATIN.

[19]  Pier Paolo Di Fiore,et al.  A single motif responsible for ubiquitin recognition and monoubiquitination in endocytic proteins , 2002, Nature.

[20]  Jason Tsong-Li Wang,et al.  Effective hidden Markov models for detecting splicing junction sites in DNA sequences , 2001, Inf. Sci..

[21]  Graziano Pesole,et al.  An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[22]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[23]  Todd Wareham,et al.  On the complexity of finding common approximate substrings , 2003, Theor. Comput. Sci..

[24]  Zhi-Zhong Chen,et al.  Fast Exact Algorithms for the Closest String and Substring Problems with Application to the Planted (L,d)-Motif Model , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  Sanguthevar Rajasekaran,et al.  Exact algorithms for planted motif challenge problems , 2005, APBC.

[26]  Sanguthevar Rajasekaran,et al.  PMS5: an efficient exact algorithm for the (ℓ, d)-motif finding problem , 2011, BMC Bioinformatics.

[27]  Mostafa M. Abbas,et al.  A hybrid method for the exact planted (l, d) motif finding problem and its parallelization , 2012, BMC Bioinformatics.

[28]  Jeremy Buhler,et al.  Finding Motifs Using Random Projections , 2002, J. Comput. Biol..

[29]  Jack W. Szostak,et al.  An RNA motif that binds ATP , 1993, Nature.

[30]  Puteh Saad,et al.  A compact hybrid feature vector for an accurate secondary structure prediction , 2011, Inf. Sci..

[31]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[32]  Andrew D. Smith,et al.  Toward Optimal Motif Enumeration , 2003, WADS.

[33]  Qiang Yu,et al.  PairMotif: A New Pattern-Driven Algorithm for Planted (l, d) DNA Motif Search , 2012, PloS one.