Finding subtle motifs with variable gaps in unaligned DNA sequences

Biologists have determined that the control and regulation of gene expression is primarily determined by relatively short sequences in the region surrounding a gene. These sequences vary in length, position, redundancy, orientation, and bases. Finding these short sequences is a fundamental problem in molecular biology with important applications. Though there exist many different approaches to signal (i.e. short sequence) finding, some new study shows that this problem still leaves plenty of room for improvement. In 2000, Pevzner and Sze proposed the Challenge Problem of motif detection. They reported that most current motif finding algorithms are incapable of detecting the target motifs in their Challenge Problem. In this paper, we show that using an iterative-restart design, our new algorithm can correctly find the target motifs. Furthermore, taking into account the fact that some transcription factors form a dimer or even more complex structures, and transcription process can sometimes involve multiple factors with variable spacers in between, we extend the original problem to an even more challenging one by addressing the issue of combinatorial signals with gaps of variable lengths. To demonstrate the effectiveness of our algorithm, we tested it on a series of the new challenge problem as well as real regulons, and compared it with some current representative motif-finding algorithms.

[1]  E. Koonin,et al.  Prediction of transcription regulatory sites in Archaea by a comparative genomic approach. , 2000, Nucleic acids research.

[2]  R. Lathe Phd by thesis , 1988, Nature.

[3]  Yuh-Jyh Hu,et al.  An integrated approach for genome-wide gene expression analysis , 2001, Comput. Methods Programs Biomed..

[4]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[5]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[6]  Gary D. Stormo,et al.  Identification of consensus patterns in unaligned DNA sequences known to be functionally related , 1990, Comput. Appl. Biosci..

[7]  Bin Ma,et al.  Finding similar regions in many strings , 1999, STOC '99.

[8]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[9]  Martin Tompa,et al.  An algorithm for finding novel gapped motifs in DNA sequences , 1998, RECOMB '98.

[10]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[11]  A. Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[12]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[13]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[14]  C. Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Machine Learning.

[15]  Yuh-Jyh Hu,et al.  Combinatorial motif analysis and hypothesis generation on a genomic scale , 2000, Bioinform..

[16]  Yuh-Jyh Hu,et al.  Detecting Motifs from Sequences , 1999, ICML.

[17]  L. Wodicka,et al.  Genome-wide expression monitoring in Saccharomyces cerevisiae , 1997, Nature Biotechnology.

[18]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.