Nonrandom Clusters of Palindromes in Herpesvirus Genomes

Palindromes are symmetrical words of DNA in the sense that they read exactly the same as their reverse complementary sequences. Representing the occurrences of palindromes in a DNA molecule as points on the unit interval, the scan statistics can be used to identify regions of unusually high concentration of palindromes. These regions have been associated with the replication origins on a few herpesviruses in previous studies. However, the use of scan statistics requires the assumption that the points representing the palindromes are independently and uniformly distributed on the unit interval. In this paper, we provide a mathematical basis for this assumption by showing that in randomly generated DNA sequences, the occurrences of palindromes can be approximated by a Poisson process. An easily computable upper bound on the Wasserstein distance between the palindrome process and the Poisson process is obtained. This bound is then used as a guide to choose an optimal palindrome length in the analysis of a collection of 16 herpesvirus genomes. Regions harboring significant palindrome clusters are identified and compared to known locations of replication origins. This analysis brings out a few interesting extensions of the scan statistics that can help formulate an algorithm for more accurate prediction of replication origins.

[1]  J. Doob Stochastic processes , 1953 .

[2]  J. Naus The Distribution of the Size of the Maximum Cluster of Points on a Line , 1965 .

[3]  C. Stein A bound for the error in the normal approximation to the distribution of a sum of dependent random variables , 1972 .

[4]  M. Gefter,et al.  DNA Replication , 2019, Advances in Experimental Medicine and Biology.

[5]  Noel A Cressie,et al.  The minimum of higher order gaps , 1977 .

[6]  S. Karlin,et al.  A second course in stochastic processes , 1981 .

[7]  J. Cherry,et al.  Breast cancer study , 1982, CA: a cancer journal for clinicians.

[8]  Schaffer,et al.  Cloning, sequencing, and functional analysis of oriL, a herpes simplex virus type 1 origin of DNA synthesis , 1985, Molecular and cellular biology.

[9]  D. Reisman,et al.  A putative origin of replication of plasmids derived from Epstein-Barr virus is composed of two cis-acting components , 1985, Molecular and cellular biology.

[10]  R. Ivarie,et al.  The effect of codon usage on the oligonucleotide composition of the E. coli genome and identification of over- and underrepresented sequences by Markov chain analysis. , 1987, Nucleic acids research.

[11]  K. Weston,et al.  An enhancer element in the short unique region of human cytomegalovirus regulates the production of a group of abundant immediate early transcripts. , 1988, Virology.

[12]  L. Kuller Breast cancer study. , 1988, Science.

[13]  D. O’Callaghan,et al.  Functional mapping and DNA sequence of an equine herpesvirus 1 origin of replication , 1989, Journal of virology.

[14]  Joseph Glaz,et al.  Approximations and Bounds for the Distribution of the Scan Statistic , 1989 .

[15]  L. Gordon,et al.  Poisson Approximation and the Chen-Stein Method , 1990 .

[16]  W Gibson,et al.  Identification of the lytic origin of DNA replication in human cytomegalovirus by a novel approach utilizing ganciclovir-induced chain termination , 1990, Journal of virology.

[17]  Michael S. Waterman,et al.  [Poisson Approximation and the Chen-Stein Method]: Comment , 1990 .

[18]  E. Wagner Herpesvirus Transcription and Its Regulation , 1991 .

[19]  S Karlin,et al.  An efficient algorithm for identifying matches with errors in multiple long molecular sequences. , 1991, Journal of molecular biology.

[20]  D. Lilley,et al.  DNA replication, 2nd edn , 1992 .

[21]  A. Barbour,et al.  Poisson Approximation , 1992 .

[22]  S Karlin,et al.  Human cytomegalovirus origin of DNA replication (oriLyt) resides within a highly complex repetitive region. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[23]  S. Karlin,et al.  Over- and under-representation of short oligonucleotides in DNA sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Amir Dembo,et al.  Poisson Approximations for $r$-Scan Processes , 1992 .

[25]  Compound poisson approximations for the numbers of extreme spacings , 1993, Advances in Applied Probability.

[26]  S Karlin,et al.  Assessments of DNA inhomogeneities in yeast chromosome III. , 1993, Nucleic acids research.

[27]  Joseph Naus,et al.  Poisson approximations for the distribution and moments of ordered m -spacings , 1994 .

[28]  Bernard Prum,et al.  Finding words with unexpected frequencies in deoxyribonucleic acid sequences , 1995 .

[29]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[30]  B E Griffin,et al.  Epstein-Barr virus in epithelial cell tumors: a breast cancer study. , 1995, Cancer research.

[31]  Gesine Reinert,et al.  Poisson Process Approximation for Sequence Repeats and Sequencing by Hybridization , 1996, J. Comput. Biol..

[32]  Terence P. Speed,et al.  Over- and Underrepresentation of Short DNA Words in Herpesvirus Genomes , 1996, J. Comput. Biol..

[33]  D. Ghosh,et al.  Palindromes in Random Letter Generation: Poisson Approximations, Rates of Growth,and Erdös-Rényi Laws , 1996 .

[34]  Sophie Schbath,et al.  Compound Poisson approximation of word counts in DNA sequences , 1997 .

[35]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[36]  Gesine Reinert,et al.  Compound Poisson and Poisson Process Approximations for Occurrences of Multiple Words in Markov Chains , 1998, J. Comput. Biol..

[37]  Lili Huang,et al.  Human Cytomegalovirus oriLyt Sequence Requirements , 1998, Journal of Virology.

[38]  S. Karlin,et al.  Strand compositional asymmetry in bacterial and large viral genomes. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[39]  S. Salzberg,et al.  Skewed oligomers and origins of replication. , 1998, Gene.

[40]  Ming-Ying Leung,et al.  Applications of the Scan Statistic in DNA Sequence Analysis , 1999 .

[41]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[42]  G. V. Weinberg,et al.  Removing logarithms from Poisson process error bounds , 2000 .

[43]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[44]  W. Hammerschmidt,et al.  The genetic approach to the Epstein-Barr virus: from basic virology to gene therapy , 2000, Molecular pathology : MP.

[45]  G. Paterakis,et al.  Herpes simplex virus type 2: a cause of acute retinal necrosis syndrome , 2001, Ocular immunology and inflammation.

[46]  Paul J. Farrell,et al.  Epstein-Barr Virus , 2001 .

[47]  J. Biswas,et al.  Central retinal vein occlusion due to herpes zoster as the initial presenting sign in a patient with acquired immunodeficiency syndrome (AIDS) , 2001, Ocular immunology and inflammation.

[48]  S. Larson,et al.  Positron emission tomography imaging for herpes virus infection: Implications for oncolytic viral treatments of cancer , 2001, Nature Medicine.

[49]  T. C. Brown,et al.  Stein's Method and Birth-Death Processes , 2001 .

[50]  Bill Sugden,et al.  In the beginning: a viral origin exploits the cell. , 2002, Trends in biochemical sciences.

[51]  C. Newlon,et al.  DNA replication joins the revolution: whole-genome views of DNA replication in budding yeast. , 2002, BioEssays : news and reviews in molecular, cellular and developmental biology.

[52]  M. Martin-Négrier,et al.  Concurrent herpes simplex type 1 necrotizing encephalitis, cytomegalovirus ventriculoencephalitis and cerebral lymphoma in an AIDS patient , 2004, Acta Neuropathologica.

[53]  Stein’s method, Palm theory and Poisson process approximation , 2004, math/0410169.

[54]  A. Bridgen The derivation of a restriction endonuclease map forAlcelaphine herpesvirus 1 DNA , 2005, Archives of Virology.

[55]  Inverted and mirror repeats in model nucleotide sequences. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[56]  Lawrence Corey,et al.  138 – Herpes Simplex Virus , 2015 .