Applications of the Scan Statistic in DNA Sequence Analysis

Advances of biochemical techniques have made available large databases of long DNA sequences. These sequences reflect conglomerates of random and nonrandom letter strings from the nucleotide alphabet {A, C, G, T}. As the databases expand, mathematical methods play an increasingly important role in analyzing and interpreting the rapidly accumulating DNA data. In this chapter, we discuss a specific example of identifying nonrandom clusters of palindromes in a family of herpesvirus genomes using the r-scan statistic. Palindrome positions on the genome are modeled by i.i.d. random variables uniformly distributed on the unit interval (0,1). After a comparison of three Poisson-type approximations, the r-scan distribution is computed by a compound Poisson approximation proposed by Glaz (1994). Some of the significant palindrome clusters are located at genome regions containing origins of replication and regulatory signals of the herpesviruses.

[1]  Joseph Glaz,et al.  Approximations and Bounds for the Distribution of the Scan Statistic , 1989 .

[2]  Mark Berman,et al.  A Useful Upper Bound for the Tail Probabilities of the Scan Statistic When the Sample Size is Large , 1985 .

[3]  R. Doolittle Molecular evolution: computer analysis of protein and nucleic acid sequences. , 1990, Methods in enzymology.

[4]  S Karlin,et al.  Human cytomegalovirus origin of DNA replication (oriLyt) resides within a highly complex repetitive region. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[5]  S. Karlin,et al.  A second course in stochastic processes , 1981 .

[6]  S. Karlin,et al.  Frequent oligonucleotides and peptides of the Haemophilus influenzae genome. , 1996, Nucleic acids research.

[7]  Compound poisson approximations for the numbers of extreme spacings , 1993, Advances in Applied Probability.

[8]  D. Aldous Probability Approximations via the Poisson Clumping Heuristic , 1988 .

[9]  Joseph Naus,et al.  Poisson approximations for the distribution and moments of ordered m -spacings , 1994 .

[10]  B E Griffin,et al.  Epstein-Barr virus in epithelial cell tumors: a breast cancer study. , 1995, Cancer research.

[11]  S Karlin,et al.  An efficient algorithm for identifying matches with errors in multiple long molecular sequences. , 1991, Journal of molecular biology.

[12]  M. Waterman Mathematical Methods for DNA Sequences , 1989 .

[13]  Noel A Cressie,et al.  The minimum of higher order gaps , 1977 .

[14]  Pattern matching between two non-aligned random sequences , 1994 .

[15]  A. Barbour,et al.  Poisson Approximation , 1992 .

[16]  J. Glaz Approximations for tail probabilities and moments of the scan statistic , 1992 .

[17]  K. Weston,et al.  An enhancer element in the short unique region of human cytomegalovirus regulates the production of a group of abundant immediate early transcripts. , 1988, Virology.

[18]  Lawrence Corey,et al.  138 – Herpes Simplex Virus , 2015 .

[19]  Louis H. Y. Chen Poisson Approximation for Dependent Trials , 1975 .

[20]  Joseph Naus,et al.  Tight Bounds and Approximations for Scan Statistic Probabilities for Discrete Data , 1991 .

[21]  Amir Dembo,et al.  Poisson Approximations for $r$-Scan Processes , 1992 .

[22]  L. Gordon,et al.  Two moments su ce for Poisson approx-imations: the Chen-Stein method , 1989 .

[23]  J. Naus,et al.  Screening for unusual matched segments in multiple protein sequences , 1996 .

[24]  L. Gordon,et al.  [Poisson Approximation and the Chen-Stein Method]: Rejoinder , 1990 .

[25]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[26]  Chien-Tai Lin,et al.  Approximating the Distribution of the Scan Statistic Using Moments of the Number of Clumps , 1997 .

[27]  S. Karlin,et al.  Chance and statistical significance in protein and DNA sequence analysis. , 1992, Science.

[28]  S Karlin,et al.  Computational DNA sequence analysis. , 1994, Annual review of microbiology.

[29]  Chien-Tai Lin,et al.  Computing the exact distribution of the extremes of sums of consecutive spacings , 1997 .

[30]  S Karlin,et al.  Assessments of DNA inhomogeneities in yeast chromosome III. , 1993, Nucleic acids research.

[31]  Terence P. Speed,et al.  Over- and Underrepresentation of Short DNA Words in Herpesvirus Genomes , 1996, J. Comput. Biol..

[32]  S Karlin,et al.  Compositional biases of bacterial genomes and evolutionary implications , 1997, Journal of bacteriology.

[33]  L. Gordon,et al.  Poisson Approximation and the Chen-Stein Method , 1990 .