cWINNOWER algorithm for finding fuzzy DNA motifs

The cWINNOWER algorithm detects fuzzy motifs in DNA sequences rich in protein-binding signals. A signal is defined as any short nucleotide pattern having up to d mutations differing from a motif of length l. The algorithm finds such motifs if multiple mutated copies of the motif (i.e., the signals) are present in the DNA sequence in sufficient abundance. The cWINNOWER algorithm substantially improves the sensitivity of the winnower method of Pevzner and Sze by imposing a consensus constraint, enabling it to detect much weaker signals. We studied the minimum number of detectable motifs q/sub c/ as a function of sequence length N for random sequences. We found that q/sub c/ increases linearly with N for a fast version of the algorithm based on counting three-member sub-cliques. Imposing consensus constraints reduces q/sub c/ by a factor of three in this case, which makes the algorithm dramatically more sensitive. Our most sensitive algorithm, which counts four-member sub-cliques, needs a minimum of only 13 signals to detect motifs in a sequence of length N=12000 for (l, d)=(15,4).

[1]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[2]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[3]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[4]  R. Jackson Genomic regulatory systems , 2001 .

[5]  Michael Levine,et al.  Genome-wide identification of tissue-specific enhancers in the Ciona tadpole , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[7]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[8]  Hao Li Computational approaches to identifying transcription factor binding sites in yeast genome. , 2002, Methods in enzymology.

[9]  G. Church,et al.  Computational identification of transcription factor binding sites via a transcription-factor-centric clustering (TFCC) algorithm. , 2002, Journal of molecular biology.

[10]  D. Botstein,et al.  Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF , 2001, Nature.

[11]  Mikhail S. Gelfand,et al.  Finding Weak Motifs in DNA Sequences , 2001, Pacific Symposium on Biocomputing.

[12]  E. Davidson Genomic Regulatory Systems: Development and Evolution , 2005 .

[13]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[14]  H. Bussemaker,et al.  Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[15]  John J. Wyrick,et al.  Genome-wide location and function of DNA binding proteins. , 2000, Science.

[16]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.