Divergence and Shannon information in genomes.

Shannon information (SI) and its special case, divergence, are defined for a DNA sequence in terms of probabilities of chemical words in the sequence and are computed for a set of complete genomes highly diverse in length and composition. We find the following: SI (but not divergence) is inversely proportional to sequence length for a random sequence but is length independent for genomes; the genomic SI is always greater and, for shorter words and longer sequences, hundreds to thousands times greater than the SI in a random sequence whose length and composition match those of the genome; genomic SIs appear to have word-length dependent universal values. The universality is inferred to be an evolution footprint of a universal mode for genome growth.

[1]  R. Fleischmann,et al.  Frequency and distribution of DNA uptake signal sequences in the Haemophilus influenzae Rd genome. , 1995, Science.

[2]  S. Karlin,et al.  Frequent oligonucleotides and peptides of the Haemophilus influenzae genome. , 1996, Nucleic acids research.

[3]  H. Bussemaker,et al.  Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[4]  L. Hsieh,et al.  Universality in large-scale structure of complete genomes , 2004, Genome Biology.

[5]  Claude E. Shannon,et al.  The mathematical theory of communication , 1950 .

[6]  L. Luo,et al.  Short segmental duplication: parsimony in growth of microbial genomes , 2003, Genome Biology.

[7]  Simon Levin Computational Molecular Biology An Introduction , 2000 .

[8]  宁北芳,et al.  疟原虫var基因转换速率变化导致抗原变异[英]/Paul H, Robert P, Christodoulou Z, et al//Proc Natl Acad Sci U S A , 2005 .

[9]  Liaofu Luo,et al.  Minimal model for genome evolution and growth. , 2002, Physical review letters.

[10]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[11]  H E Stanley,et al.  Finding borders between coding and noncoding DNA regions by an entropic segmentation method. , 2000, Physical review letters.

[12]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[13]  I. Grosse,et al.  Analysis of symbolic sequences using the Jensen-Shannon divergence. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[14]  B. Hao,et al.  Fractals related to long DNA sequences and complete genomes , 2000 .

[15]  Lila L. Gatlin,et al.  Information theory and the living system , 1972 .