Some statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the Drosophila genome: the fluffy-tail test

BackgroundThis paper addresses the problem of recognising DNA cis-regulatory modules which are located far from genes. Experimental procedures for this are slow and costly, and computational methods are hard, because they lack positional information.ResultsWe present a novel statistical method, the "fluffy-tail test", to recognise regulatory DNA. We exploit one of the basic informational properties of regulatory DNA: abundance of over-represented transcription factor binding site (TFBS) motifs, although we do not look for specific TFBS motifs, per se . Though overrepresentation of TFBS motifs in regulatory DNA has been intensively exploited by many algorithms, it is still a difficult problem to distinguish regulatory from other genomic DNA.ConclusionWe show that, in the data used, our method is able to distinguish cis-regulatory modules by exploiting statistical differences between the probability distributions of similar words in regulatory and other DNA. The potential application of our method includes annotation of new genomic sequences and motif discovery.

[1]  E. Davidson,et al.  Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. , 1998, Science.

[2]  P. Bucher,et al.  Searching for regulatory elements in human noncoding sequences. , 1997, Current opinion in structural biology.

[3]  Dmitri A. Papatsenko,et al.  Statistical extraction of Drosophila cis-regulatory modules using exhaustive assessment of local word frequency , 2003, BMC Bioinformatics.

[4]  Elmar Nöth,et al.  Interpolated markov chains for eukaryotic promoter recognition , 1999, Bioinform..

[5]  Alain Arneodo,et al.  Long-range correlations between DNA bending sites: relation to the structure and dynamics of nucleosomes. , 2002, Journal of molecular biology.

[6]  Jon D. McAuliffe,et al.  Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of the Human Genome , 2003, Science.

[7]  U. Ohler,et al.  Promoter Prediction on a Genomic Scale – the Adh Experience , 2000 .

[8]  G. Rubin,et al.  Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[9]  A. Wagner,et al.  A computational genomics approach to the identification of gene networks. , 1997, Nucleic acids research.

[10]  Anna G. Nazina,et al.  Homotypic regulatory clusters in Drosophila. , 2003, Genome research.

[11]  R. Jackson Genomic regulatory systems , 2001 .

[12]  E. Davidson,et al.  Cis-regulatory logic in the endo16 gene: switching from a specification to a differentiation mode of control. , 2001, Development.

[13]  Peter W. Markstein,et al.  Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[14]  L. Pachter,et al.  Strategies and tools for whole-genome alignments. , 2002, Genome research.

[15]  Heinrich Niemann,et al.  Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition , 2001, ISMB.

[16]  Massimo Vergassola,et al.  Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo , 2002, BMC Bioinformatics.

[17]  Yuriy L. Orlov,et al.  Complexity: an internet resource for analysis of DNA sequence complexity , 2004, Nucleic Acids Res..

[18]  W. Miller,et al.  Distinguishing regulatory DNA from neutral sites. , 2003, Genome research.

[19]  Wyeth W. Wasserman,et al.  Identification of functional clusters of transcription factor binding motifs in genome sequences: the MSCAN algorithm , 2003, ISMB.

[20]  Mathieu Blanchette,et al.  Algorithms for phylogenetic footprinting , 2001, RECOMB.