Randomized Algorithms for Motif Detection

Motivation: Motif detection for DNA sequences has many important applications in biological studies, e.g., locating binding sites and regulatory signals, and designing genetic probes etc In this paper, we propose a randomized algorithm, design an improved EM algorithm and combine them to form a software. Results: (1) We design a randomized algorithm for consensus pattern problem We can show that with high probability, our randomized algorithm finds a pattern in polynomial time with cost error at most e × l for each string, where l is the length of the motif and e can be any positive number given by the user (2) We design an improved EM (Expectation Maximization) algorithm that outperforms the original EM algorithm (3) We develop a software MotifDetector that uses our randomized algorithm to find good seeds and uses the improved EM algorithm to do local search We compare MotifDetector with Buhler and Tompa's PROJECTION which is considered to be the best known software for motif detection Simulations show that MotifDetector is slower than PROJECTION when the pattern length is relatively small, and outperforms PROJECTION when the pattern length becomes large. Availability: Free from http://www.cs.cityu.edu.hk/~lwang/software/motif/index.html, subject to copyright restrictions.

[1]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[2]  Benno Schwikowski,et al.  Algorithms for Phylogenetic Footprinting , 2002, J. Comput. Biol..

[3]  K. Lucas,et al.  An improved microcomputer program for finding gene- or gene family-specific oligonucleotides suitable as primers for polymerase chain reactions or as probes , 1991, Comput. Appl. Biosci..

[4]  Jeremy Buhler,et al.  Finding Motifs Using Random Projections , 2002, J. Comput. Biol..

[5]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[6]  P. Bucher,et al.  Searching for regulatory elements in human noncoding sequences. , 1997, Current opinion in structural biology.

[7]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[8]  Bin Ma,et al.  On the closest string and substring problems , 2002, JACM.

[9]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[10]  Bin Ma,et al.  Finding Similar Regions in Many Sequences , 2002, J. Comput. Syst. Sci..

[11]  Uri Keich,et al.  Finding motifs in the twilight zone , 2002, Bioinform..

[12]  Uri Keich,et al.  U Subtle motifs: defining the limits of motif finding algorithms , 2002, Bioinform..

[13]  Edward C. Holmes,et al.  Primer Master: a new program for the design and analysis of PCR primers , 1996, Comput. Appl. Biosci..

[14]  Bin Ma,et al.  Distinguishing string selection problems , 2003, SODA '99.

[15]  Joaquín Dopazo,et al.  Design of primers for PCR amplification of highly variable genomes , 1993, Comput. Appl. Biosci..

[16]  L. Duret,et al.  Strong conservation of non-coding sequences during vertebrates evolution: potential involvement in post-transcriptional regulation of gene expression. , 1993, Nucleic acids research.

[17]  Mathieu Blanchette,et al.  Algorithms for phylogenetic footprinting , 2001, RECOMB.