An Upper Bound on the Hardness of Exact Matrix Based Motif Discovery

Motif discovery is the problem of finding local patterns or motifs from a set of unlabeled sequences. One common representation of a motif is a Markov model known as a score matrix. Matrix based motif discovery has been extensively studied but no positive results have been known regarding its theoretical hardness. We present the first non-trivial upper bound on the complexity (worst-case computation time) of this problem. Other than linear terms, our bound depends only on the motif width w (which is typically 5-20) and is a dramatic improvement relative to previously known bounds. We prove this bound by relating the motif discovery problem to a search problem over permutations of strings of length w, in which the permutations have a particular property. We give a constructive proof of an upper bound on the number of such permutations. For an alphabet size of σ (typically 4) the trivial bound is $n! \approx ({\frac{n}{e}})^n, n={\sigma}^w$. Our bound is roughly n(σlogσn)n. We relate this theoretical result to the exact motif discovery program, TsukubaBB, whose algorithm contains ideas which inspired the result. We describe a recent improvement to the TsukubaBB program which can give a speed up of nine or more and use a dataset of REB1 transcription factor binding sites to illustrate that exact methods can indeed be used in some practical situations.

[1]  Bin Ma,et al.  Finding similar regions in many strings , 1999, STOC '99.

[2]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[3]  Paul Horton Tsukuba BB: A Branch and Bound Algorithm for Local Multiple Alignment of DNA and Protein Sequences , 2001, J. Comput. Biol..

[4]  Dimitrios I. Fotiadis,et al.  Greedy mixture learning for multiple motif discovery in biological sequences , 2003, Bioinform..

[5]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[6]  Z. Weng,et al.  Finding functional sequence elements by multiple local alignment. , 2004, Nucleic acids research.

[7]  Michael Q. Zhang,et al.  SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[8]  Hiroki Arimura,et al.  On approximation algorithms for local multiple alignment , 2000, RECOMB '00.

[9]  Bin Ma,et al.  Finding Similar Regions in Many Sequences , 2002, J. Comput. Syst. Sci..

[10]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[11]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[12]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[13]  Gary D. Stormo,et al.  Identification of consensus patterns in unaligned DNA sequences known to be functionally related , 1990, Comput. Appl. Biosci..

[14]  P Horton A branch and bound algorithm for local multiple alignment. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[15]  Paul Horton Tsukuba BB: A Branch and Bound Algorithm for Local Multiple Sequence Alignment , 2000, CPM.