Nonoverlapping clusters: approximate distribution and application to molecular biology.

An approach is developed for the screening of genomic sequence data to identify gene regulatory regions. This approach is based on deciding if putative transcription factor binding sites are clustered together to a greater extent than one would expect by chance. Given n events occurring on an interval of width L (L base pairs), an r:w cluster is defined as r + 1 consecutive events all contained within a window of length wL. Accurate and easily computable approximations are derived for the distribution of the number of nonoverlapping r:w clusters under the model that the positions of the n events have a uniform distribution. Simulations demonstrate that these approximations have greater accuracy than existing methods. The approximation is applied to detect erythroid-specific regulatory regions in genomic DNA sequences, first in an artificial case where r is specified a priori and then as part of an exploratory approach.

[1]  T. Werner,et al.  GenomeInspector: basic software tools for analysis of spatial correlations between genomic structures within megabase sequences. , 1996, Genomics.

[2]  Holger Karas,et al.  TRANSFAC: a database on transcription factors and their DNA binding sites , 1996, Nucleic Acids Res..

[3]  Mark Berman,et al.  A Useful Upper Bound for the Tail Probabilities of the Scan Statistic When the Sample Size is Large , 1985 .

[4]  S. Wallenstein,et al.  New approximations for the distribution of the r-scan statistic , 2000 .

[5]  Andreas Wagner,et al.  Genes regulated cooperatively by one or more transcription factors and their identification in whole eukaryotic genomes , 1999, Bioinform..

[6]  Gary D. Stormo,et al.  MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices , 1995, Comput. Appl. Biosci..

[7]  Catherine Macken,et al.  Some statistical problems in the assessment of inhomogeneities of DNA sequence data , 1991 .

[8]  S. Orkin,et al.  Functional synergy and physical interactions of the erythroid transcription factor GATA-1 with the Krüppel family proteins Sp1 and EKLF , 1995, Molecular and cellular biology.

[9]  Amir Dembo,et al.  Poisson Approximations for $r$-Scan Processes , 1992 .

[10]  Chien-Tai Lin,et al.  Approximating the Distribution of the Scan Statistic Using Moments of the Number of Clumps , 1997 .

[11]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[12]  S. Karlin,et al.  Chance and statistical significance in protein and DNA sequence analysis. , 1992, Science.

[13]  Joseph Naus,et al.  Poisson approximations for the distribution and moments of ordered m -spacings , 1994 .

[14]  L. Gordon,et al.  Poisson Approximation and the Chen-Stein Method , 1990 .

[15]  K. Roeder,et al.  A statistical model for locating regulatory regions in genomic DNA. , 1997, Journal of molecular biology.

[16]  M. Q. Zhang,et al.  Identification of human gene core promoters in silico. , 1998, Genome research.

[17]  J. Glaz Approximations for tail probabilities and moments of the scan statistic , 1992 .

[18]  Joseph Naus,et al.  Approximations for Distributions of Scan Statistics , 1982 .

[19]  J. Fickett,et al.  Identification of regulatory regions which confer muscle-specific gene expression. , 1998, Journal of molecular biology.

[20]  Multiple clusters on the line , 1983 .