Repeats and correlations in human DNA sequences.

We study the nucleotide-nucleotide mutual information function I(k) of the DNA sequences of the three completely sequenced human chromosomes 20, 21, and 22. We find in each human chromosome (i) the absence of the k=3 base pair (bp) sequence periodicity characteristic for protein coding regions, (ii) the absence of the k=10-11 bp sequence periodicity characteristic for both protein secondary structure and DNA bendability, and (iii) the presence of significant statistical dependencies at about k=135 bp and at about k=165 bp. We investigate to which degree the density and composition of interspersed repeats might explain these observed statistical patterns in all three human chromosomes. We use simple stochastic models to substitute known interspersed repeats and find by numerical studies that (iv) the presence of interspersed repeats dominates short-range correlations as measured by I(k) on the scale of several hundred base pairs in human chromosomes 20, 21, and 22. On the other hand, we find that (v) interspersed repeats contribute only weakly to long-range correlations due to the clustering of highly abundant Alu repeats.

[1]  A. Smit Interspersed repeats and other mementos of transposable elements in mammalian genomes. , 1999, Current opinion in genetics & development.

[2]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[3]  K. J. Fryxell,et al.  Cytosine deamination plays a primary role in the evolution of mammalian isochores. , 2000, Molecular biology and evolution.

[4]  G Bernardi,et al.  An analysis of the bovine genome by Cs2SO4-Ag density gradient centrifugation. , 1973, Journal of molecular biology.

[5]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[6]  C. Peng,et al.  Long-range correlations in nucleotide sequences , 1992, Nature.

[7]  Wentian Li,et al.  Understanding long-range correlations in DNA sequences , 1994, chao-dyn/9403002.

[8]  R. Mantegna,et al.  Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[9]  H. Quastler Information theory in psychology , 1955 .

[10]  D Häring,et al.  No isochores in the human chromosomes 21 and 22? , 2001, Biochemical and biophysical research communications.

[11]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[12]  A. Nekrutenko,et al.  Assessment of compositional heterogeneity within and between eukaryotic genomes. , 2000, Genome research.

[13]  Mikhail S. Gelfand,et al.  Prediction of Function in DNA Sequence , 1995, J. Comput. Biol..

[14]  S V Buldyrev,et al.  Optimization of coding potentials using positional dependence of nucleotide frequencies. , 2000, Journal of theoretical biology.

[15]  L. N. van de Lagemaat,et al.  Retroelement distributions in the human genome: variations associated with age and proximity to genes. , 2002, Genome research.

[16]  Eric S. Lander,et al.  Human genome sequence variation and the influence of gene history, mutation and recombination , 2002, Nature Genetics.

[17]  A. D. McLachlan,et al.  Codon preference and its use in identifying protein coding regions in long DNA sequences , 1982, Nucleic Acids Res..

[18]  I Grosse,et al.  Statistical analysis of the DNA sequence of human chromosome 22. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[19]  Alain Arneodo,et al.  Long-range correlations between DNA bending sites: relation to the structure and dynamics of nucleosomes. , 2002, Journal of molecular biology.

[20]  Dan Graur,et al.  Alu-containing exons are alternatively spliced. , 2002, Genome research.

[21]  J. Thompson,et al.  Multiple sequence alignment with Clustal X. , 1998, Trends in biochemical sciences.

[22]  Werner Ebeling,et al.  Dynamics and Complexity of Biomolecules , 1987 .

[23]  A. Smit,et al.  The origin of interspersed repeats in the human genome. , 1996, Current opinion in genetics & development.

[24]  G. Basharin On a Statistical Estimate for the Entropy of a Sequence of Independent Random Variables , 1959 .

[25]  Improving Gene Therapy's Tool Kit , 2000, Science.

[26]  R E Harrington,et al.  Curved DNA without A-A: experimental estimation of all 16 DNA wedge angles. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Hanspeter Herzel,et al.  Correlations in DNA sequences: The role of protein coding segments , 1997 .

[28]  Daiya Takai,et al.  Comprehensive analysis of CpG islands in human chromosomes 21 and 22 , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[29]  C. Burge,et al.  Assessment of the total number of human transcription units. , 2001, Genomics.

[30]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[31]  G. K. Wong,et al.  Is "junk" DNA mostly intron DNA? , 2000, Genome research.

[32]  Wentian Li,et al.  The Study of Correlation Structures of DNA Sequences: A Critical Review , 1997, Comput. Chem..

[33]  J. Lupski,et al.  An evaluation of the draft human genome sequence , 2001, Nature Genetics.

[34]  Alain Arneodo,et al.  Long-Range Correlations in Genomic DNA , 2001 .

[35]  G Bernardi,et al.  The mosaic genome of warm-blooded vertebrates. , 1985, Science.

[36]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[37]  R. L. Adams,et al.  CpG deficiency, dinucleotide distributions and nucleosome positioning. , 1987, European journal of biochemistry.

[38]  H Herzel,et al.  Correlations in protein sequences and property codes. , 1998, Journal of theoretical biology.

[39]  G Bernardi,et al.  Compositional heterogeneity within and among isochores in mammalian genomes. I. CsCl and sequence analyses. , 2001, Gene.

[40]  Melanie E. Goward,et al.  The DNA sequence of human chromosome 22 , 1999, Nature.

[41]  Ian Dunham,et al.  Reevaluating human gene annotation: a second-generation analysis of chromosome 22. , 2003, Genome research.

[42]  C. Peng,et al.  Fractal landscapes and molecular evolution: modeling the myosin heavy chain gene family. , 1993, Biophysical journal.

[43]  Mary C. Rykowski,et al.  Human genome organization: Alu, LINES, and the molecular structure of metaphase chromosome bands , 1988, Cell.

[44]  T. Darden,et al.  Biased distribution of inverted and direct Alus in the human genome: implications for insertion, exclusion, and genome stability. , 2001, Genome research.

[45]  G Bernardi,et al.  CpG doublets, CpG islands and Alu repeats in long human DNA sequences from different isochore families. , 1998, Gene.

[46]  P. Green,et al.  Against a whole-genome shotgun. , 1997, Genome research.

[47]  Laurence D. Hurst,et al.  The evolution of isochores , 2001, Nature Reviews Genetics.

[48]  E. Trifonov,et al.  The pitch of chromatin DNA is reflected in its nucleotide sequence. , 1980, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Wentian Li,et al.  Long-range correlation and partial 1/fα spectrum in a noncoding DNA sequence , 1992 .

[50]  Hanspeter Herzel,et al.  10-11 bp periodicities in complete genomes reflect protein structure and DNA folding , 1999, Bioinform..

[51]  S. Buldyrev,et al.  Species independence of mutual information in coding and noncoding DNA. , 2000, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[52]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[53]  J. Muzy,et al.  Long-range correlations in genomic DNA: a signature of the nucleosomal structure. , 2001, Physical review letters.

[54]  A. Nekrutenko,et al.  Transposable elements are found in a large number of human protein-coding genes. , 2001, Trends in genetics : TIG.

[55]  R. Shiekhattar,et al.  A chromatin remodelling complex that loads cohesin onto human chromosomes , 2002, Nature.

[56]  M. Roulston Estimating the errors on measured entropy and mutual information , 1999 .

[57]  Wentian Li,et al.  Are isochore sequences homogeneous? , 2002, Gene.

[58]  M. Hattori,et al.  The DNA sequence of human chromosome 21 , 2000, Nature.

[59]  C. Schmid,et al.  Potential Alu Function: Regulation of the Activity of Double-Stranded RNA-Activated Kinase PKR , 1998, Molecular and Cellular Biology.

[60]  D R Bentley,et al.  The DNA sequence and comparative analysis of human chromosome 20 , 2004, Nature.

[61]  G Bernardi,et al.  Isochores and the evolutionary genomics of vertebrates. , 2000, Gene.

[62]  M. Batzer,et al.  Alu repeats and human genomic diversity , 2002, Nature Reviews Genetics.

[63]  S. Chandrasegaran,et al.  Structural and conformational studies on deoxyguanosyl-3',5'-deoxyadenosine monophosphate and its ethyl phosphotriester analogs--left-handed dimers. , 1987, Journal of biomolecular structure & dynamics.