We measured local compositional complexity (LCC) of DNA sequences by calculating Shannon information content over mononucleotide frequencies. Eukaryotic DNA appeared to be "simpler" than bacterial DNA even at the level of short oligonucleotides. Moreover, different DNA functional domains displayed different compositional complexity in a systematic manner. In particular, the complexity of exon sequences was systematically higher than the complexity of corresponding introns. We therefore present examples of complexity charts (plots of complexity versus position in sequence) for pre-mRNA sequences from higher eukaryotes. By taking a window width of 100 nucleotides and a window step of 1 nucleotide, introns can be distinguished from exons in the majority of cases studied. Complexity charts of immunoglobulin variable regions allowed correct mapping of exons and introns in these sequences as well, a task that was impossible with commercial programs available to date.
[1]
A K Konopka,et al.
Distance analysis helps to establish characteristic motifs in intron sequences.
,
1987,
Gene analysis techniques.
[2]
D. Tautz,et al.
Cryptic simplicity in DNA is a major source of genetic variation
,
1986,
Nature.
[3]
R. Britten,et al.
Repeated Sequences in DNA
,
1968
.
[4]
M. Singer.
SINEs and LINEs: Highly repeated short and long interspersed sequences in mammalian genomes
,
1982,
Cell.
[5]
J. Gall,et al.
Chromosome structure and the C-value paradox
,
1981,
The Journal of cell biology.
[6]
C. E. SHANNON,et al.
A mathematical theory of communication
,
1948,
MOCO.