Some statistical problems in the assessment of inhomogeneities of DNA sequence data

Abstract The fields of molecular genetics and medicine are accumulating DNA and protein sequence data at an accelerating rate. Discovering and interpreting sequence patterns can contribute to understanding molecular mechanisms and evolutionary processes. This article considers two types of statistical problems in these contexts: (1) identifying anomalies in the distribution of a specified biochemical marker along a DNA string; in particular, new statistical methods are set forth by which to assess excessive clustering, over dispersion, and too much regularity of the marker along the sequence. Applications are given to the physical map data of the bacterium Escherichia coli. (2) Some results and statistical problems on the assembly of cloned DNA segments are also described. Sections 2 and 3 of the article present helpful background material on DNA organization and inheritance.

[1]  Jeffrey W. Roberts,et al.  遺伝子の分子生物学 = Molecular biology of the gene , 1970 .

[2]  Joseph Glaz,et al.  Approximations and Bounds for the Distribution of the Scan Statistic , 1989 .

[3]  Samuel Karlin,et al.  A natural class of multilocus recombination processes and related measures of crossover interference , 1979, Advances in Applied Probability.

[4]  L. Gordon,et al.  Two moments su ce for Poisson approx-imations: the Chen-Stein method , 1989 .

[5]  Joseph Naus,et al.  Approximations for Distributions of Scan Statistics , 1982 .

[6]  Noel A Cressie,et al.  The minimum of higher order gaps , 1977 .

[7]  W Miller,et al.  Alignment of Escherichia coli K12 DNA sequences to a genomic restriction map. , 1990, Nucleic acids research.

[8]  Lars Holst,et al.  On multiple covering of a circle with random arcs , 1980, Journal of Applied Probability.

[9]  K. Isono,et al.  The physical map of the whole E. coli chromosome: Application of a new strategy for rapid analysis and sorting of a large genomic library , 1987, Cell.

[10]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[11]  M S Waterman,et al.  The distribution of restriction enzyme sites in Escherichia coli. , 1990, Nucleic acids research.

[12]  S. Karlin,et al.  A second course in stochastic processes , 1981 .

[13]  S. Karlin Coincident Probabilities and Applications in Combinatorics , 1988 .

[14]  S Wallenstein,et al.  An approximation for the distribution of the scan statistic. , 1987, Statistics in medicine.

[15]  M. Hodson,et al.  Identification of the cystic fibrosis gene. , 1990, BMJ.

[16]  Amir Dembo,et al.  Poisson Approximations for $r$-Scan Processes , 1992 .

[17]  H. Lodish Molecular Cell Biology , 1986 .

[18]  H. Geiringer On the Probability Theory of Linkage in Mendelian Heredity , 1944 .

[19]  L. Tsui,et al.  Erratum: Identification of the Cystic Fibrosis Gene: Cloning and Characterization of Complementary DNA , 1989, Science.

[20]  F. Schnell Some general formulations of linkage effects in inbreeding. , 1961, Genetics.