Alternate measure of information useful for DNA sequences.

We propose an alternate measure of information, called superinformation, which has been found to be very effective for analyzing the coding and noncoding regions of the DNA. This superinformation is actually a measure of the "randomness of randomness." It has been found to be highly accurate in classifying coding and noncoding regions of human DNA. In the proposed method, no prior training is required. This technique exhibits higher accuracy than previously reported techniques in distinguishing between the coding and the noncoding portions of the DNA. Superinformation can also be used to analyze the untranslated regions in various genes.

[1]  Alex van Belkum,et al.  Short-Sequence DNA Repeats in Prokaryotic Genomes , 1998, Microbiology and Molecular Biology Reviews.

[2]  H E Stanley,et al.  Finding borders between coding and noncoding DNA regions by an entropic segmentation method. , 2000, Physical review letters.

[3]  P. Andolfatto Adaptive evolution of non-coding DNA in Drosophila , 2005, Nature.

[4]  G. Basharin On a Statistical Estimate for the Entropy of a Sequence of Independent Random Variables , 1959 .

[5]  H. C. Lee,et al.  Quantitative measure of randomness and order for complete genomes. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[6]  Terence P. Speed,et al.  Over- and Underrepresentation of Short DNA Words in Herpesvirus Genomes , 1996, J. Comput. Biol..

[7]  S. Buldyrev,et al.  Species independence of mutual information in coding and noncoding DNA. , 2000, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[8]  David Loewenstern,et al.  Significantly Lower Entropy Estimates for Natural DNA Sequences , 1999, J. Comput. Biol..

[9]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[10]  P. Grassberger Finite sample corrections to entropy and dimension estimates , 1988 .

[11]  N. Shimizu,et al.  Identification of two novel 5' noncoding exons in human MNB/DYRK gene and alternatively spliced transcripts. , 1998, Biochemical and biophysical research communications.

[12]  R. Mantegna,et al.  Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[13]  C. Peng,et al.  Long-range correlations in nucleotide sequences , 1992, Nature.

[14]  Ebeling,et al.  Entropies of biosequences: The role of repeats. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[15]  Manolo Gouy,et al.  Codon catalog usage is a genome strategy modulated for gene expressivity , 1981, Nucleic Acids Res..

[16]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.