论文信息 - An Upper Bound on the Hardness of Exact Matrix Based Motif Discovery

An Upper Bound on the Hardness of Exact Matrix Based Motif Discovery

Motif discovery is the problem of finding local patterns or motifs from a set of unlabeled sequences. One common representation of a motif is a Markov model known as a score matrix. Matrix based motif discovery has been extensively studied but no positive results have been known regarding its theoretical hardness. We present the first non-trivial upper bound on the complexity (worst-case computation time) of this problem. Other than linear terms, our bound depends only on the motif width w (which is typically 5-20) and is a dramatic improvement relative to previously known bounds. We prove this bound by relating the motif discovery problem to a search problem over permutations of strings of length w, in which the permutations have a particular property. We give a constructive proof of an upper bound on the number of such permutations. For an alphabet size of σ (typically 4) the trivial bound is $n! \approx ({\frac{n}{e}})^n, n={\sigma}^w$. Our bound is roughly n(σlogσn)n. We relate this theoretical result to the exact motif discovery program, TsukubaBB, whose algorithm contains ideas which inspired the result. We describe a recent improvement to the TsukubaBB program which can give a speed up of nine or more and use a dataset of REB1 transcription factor binding sites to illustrate that exact methods can indeed be used in some practical situations.

Paul Horton | Wataru Fujibuchi

[1] Bin Ma,et al. Finding similar regions in many strings , 1999, STOC '99.

[2] Charles Elkan,et al. Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[3] Paul Horton. Tsukuba BB: A Branch and Bound Algorithm for Local Multiple Alignment of DNA and Protein Sequences , 2001, J. Comput. Biol..

[4] Dimitrios I. Fotiadis,et al. Greedy mixture learning for multiple motif discovery in biological sequences , 2003, Bioinform..

[5] A. A. Reilly,et al. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[6] Z. Weng,et al. Finding functional sequence elements by multiple local alignment. , 2004, Nucleic acids research.

[7] Michael Q. Zhang,et al. SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[8] Hiroki Arimura,et al. On approximation algorithms for local multiple alignment , 2000, RECOMB '00.

[9] Bin Ma,et al. Finding Similar Regions in Many Sequences , 2002, J. Comput. Syst. Sci..

[10] Gary D. Stormo,et al. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[11] Gary D. Stormo,et al. DNA binding sites: representation and discovery , 2000, Bioinform..

[12] Jun S. Liu,et al. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[13] Gary D. Stormo,et al. Identification of consensus patterns in unaligned DNA sequences known to be functionally related , 1990, Comput. Appl. Biosci..

[14] P Horton. A branch and bound algorithm for local multiple alignment. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[15] Paul Horton. Tsukuba BB: A Branch and Bound Algorithm for Local Multiple Sequence Alignment , 2000, CPM.