Informational structure of two closely related eukaryotic genomes.

Attempts to identify a species on the basis of its DNA sequence on purely statistical grounds have been formulated for more than a decade. The most prominent of such genome signatures relies on neighborhood correlations (i.e., dinucleotide frequencies) and, consequently, attributes species identification to mechanisms operating on the dinucleotide level (e.g., neighbor-dependent mutations). For the examples of Mus musculus and Rattus norvegicus we analyze short- and intermediate-range statistical correlations in DNA sequences. These correlation profiles are computed for all chromosomes of the two species. We find that with increasing range of correlations the capacity to distinguish between the species on the basis of this correlation profile is getting better and requires ever shorter sequence segments for obtaining a full species separation. This finding suggests that distinctive traits within the sequence are situated beyond the level of few nucleotides. The large-scale statistical patterning of DNA sequences on which such genome signatures are based is thus substantially determined by mobile elements (e.g., transposons and retrotransposons). The study and interspecies comparison of such correlation profiles can, therefore, reveal features of retrotransposition, segmental duplications, and other processes of genome evolution.

[1]  M. Blaser,et al.  Evolutionary implications of microbial genome tetranucleotide frequency biases. , 2003, Genome research.

[2]  Yi Xiao,et al.  Nonlinear analysis of correlations in Alu repeat sequences in DNA. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[3]  Liaofu Luo,et al.  Minimal model for genome evolution and growth. , 2002, Physical review letters.

[4]  Wentian Li,et al.  Universal 1/f noise, crossovers of scaling exponents, and chromosome-specific patterns of guanine-cytosine content in DNA sequences of the human genome. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[5]  Marc-Thorsten Hütt,et al.  Information theory reveals large-scale synchronisation of statistical correlations in eukaryote genomes. , 2005, Gene.

[6]  S Karlin,et al.  Genome-scale compositional comparisons in eukaryotes. , 2001, Genome research.

[7]  J. Qi,et al.  Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach , 2003, Journal of Molecular Evolution.

[8]  T. Hwa,et al.  Analytical study of the effect of recombination on evolution via DNA shuffling. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[9]  Marc-Thorsten Hütt,et al.  Genome Phylogeny Based on Short-Range Correlations in DNA Sequences , 2005, J. Comput. Biol..

[10]  J. Jurka,et al.  Duplication, coclustering, and selection of human Alu retrotransposons. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[11]  A. Riggs,et al.  DNA methylation and gene function. , 1980, Science.

[12]  Peter F. Arndt,et al.  Identification and Measurement of Neigbor Dependent Nucleotide Substitution Processes , 2005, German Conference on Bioinformatics.

[13]  Ivo Grosse,et al.  Repeats and correlations in human DNA sequences. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[14]  Manolo Gouy,et al.  Recombination rate and the distribution of transposable elements in the Drosophila melanogaster genome. , 2002, Genome research.

[15]  S Nicolay,et al.  Low frequency rhythms in human DNA sequences: a key to the organization of gene location and orientation? , 2004, Physical review letters.

[16]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[17]  L. Lipovich,et al.  Computational and experimental analysis of microsatellites in rice (Oryza sativa L.): frequency, length variation, transposon associations, and genetic marker potential. , 2001, Genome research.

[18]  S T Hess,et al.  Wide variations in neighbor-dependent substitution rates. , 1994, Journal of molecular biology.

[19]  Chang-Heng Chang,et al.  Divergence and Shannon information in genomes. , 2004, Physical review letters.

[20]  S. Karlin,et al.  Comparative DNA analysis across diverse genomes. , 1998, Annual review of genetics.

[21]  Philip J. Farabaugh,et al.  Molecular basis of base substitution hotspots in Escherichia coli , 1978, Nature.

[22]  Wentian Li,et al.  An unusual 500, 000 bases long oscillation of guanine and cytosine content in human chromosome 21 , 2004, Comput. Biol. Chem..

[23]  S Karlin,et al.  Compositional differences within and between eukaryotic genomes. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Samuel Karlin,et al.  Genes, pseudogenes, and Alu sequence organization across human chromosomes 21 and 22 , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[25]  P. Pevzner,et al.  Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. , 2004, Genome research.

[26]  C. Glover,et al.  Gene expression profiling for hematopoietic cell culture , 2006 .

[27]  John M. Greally,et al.  Short interspersed transposable elements (SINEs) are excluded from imprinted regions in the human genome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[28]  宁北芳,et al.  疟原虫var基因转换速率变化导致抗原变异[英]/Paul H, Robert P, Christodoulou Z, et al//Proc Natl Acad Sci U S A , 2005 .

[29]  David Haussler,et al.  Patterns of insertions and their covariation with substitutions in the rat, mouse, and human genomes. , 2004, Genome research.

[30]  Hanspeter Herzel,et al.  10-11 bp periodicities in complete genomes reflect protein structure and DNA folding , 1999, Bioinform..

[31]  S. Buldyrev,et al.  Species independence of mutual information in coding and noncoding DNA. , 2000, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[32]  A. Troxel,et al.  Genomic characterization of recent human LINE-1 insertions: evidence supporting random insertion. , 2001, Genome research.

[33]  Peter A. W. Lewis,et al.  STATIONARY DISCRETE AUTOREGRESSIVE‐MOVING AVERAGE TIME SERIES GENERATED BY MIXTURES , 1983 .

[34]  S Karlin,et al.  Comparisons of eukaryotic genomic sequences. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Wentian Li,et al.  SPECTRAL ANALYSIS OF GUANINE AND CYTOSINE FLUCTUATIONS OF MOUSE GENOMIC DNA , 2004, q-bio/0411017.

[36]  Michael Lässig,et al.  Solvable sequence evolution models and genomic correlations. , 2005, Physical review letters.

[37]  B. Mishra,et al.  Quantifying the mechanisms for segmental duplications in mammalian genomes by statistical analysis and modeling. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Yaw-Hwang Chen,et al.  Model for the distributions of k-mers in DNA sequences. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[39]  W. Helm,et al.  A discrete autoregressive process as a model for short-range correlations in DNA sequences , 2003 .

[40]  Daiya Takai,et al.  Comprehensive analysis of CpG islands in human chromosomes 21 and 22 , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[41]  David Haussler,et al.  Comparative recombination rates in the rat, mouse, and human genomes. , 2004, Genome research.