Rényi continuous entropy of DNA sequences.

Entropy measures of DNA sequences estimate their randomness or, inversely, their repeatability. L-block Shannon discrete entropy accounts for the empirical distribution of all length-L words and has convergence problems for finite sequences. A new entropy measure that extends Shannon's formalism is proposed. Renyi's quadratic entropy calculated with Parzen window density estimation method applied to CGR/USM continuous maps of DNA sequences constitute a novel technique to evaluate sequence global randomness without some of the former method drawbacks. The asymptotic behaviour of this new measure was analytically deduced and the calculation of entropies for several synthetic and experimental biological sequences was performed. The results obtained were compared with the distributions of the null model of randomness obtained by simulation. The biological sequences have shown a different p-value according to the kernel resolution of Parzen's method, which might indicate an unknown level of organization of their patterns. This new technique can be very useful in the study of DNA sequence complexity and provide additional tools for DNA entropy estimation. The main MATLAB applications developed and additional material are available at the webpage . Specialized functions can be obtained from the authors.

[1]  José Carlos Príncipe,et al.  A new clustering evaluation function using Renyi's information potential , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Skolnick,et al.  Global fractal dimension of human DNA sequences treated as pseudorandom walks. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[3]  I. Grosse,et al.  MEASURING CORRELATIONS IN SYMBOL SEQUENCES , 1995 .

[4]  Peter Tiño,et al.  Spatial representation of symbolic sequences through iterative function systems , 1999, IEEE Trans. Syst. Man Cybern. Part A.

[5]  Alexander Bolshoy,et al.  Sequence Complexity and DNA Curvature , 1999, Comput. Chem..

[6]  J. D. Helmann,et al.  Compilation and analysis of Bacillus subtilis sigma A-dependent promoter sequences: evidence for extended contact between RNA polymerase and upstream promoter DNA , 1995, Nucleic Acids Res..

[7]  Aapo Hyvärinen,et al.  Survey on Independent Component Analysis , 1999 .

[8]  Werner Ebeling,et al.  Entropy and complexity of finite sequences as fluctuating quantities. , 2002, Bio Systems.

[9]  Karmeshu,et al.  Study of DNA binding sites using the Rényi parametric entropy measure. , 2004, Journal of theoretical biology.

[10]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[11]  Jonas S. Almeida,et al.  Analysis of genomic sequences by Chaos Game Representation , 2001, Bioinform..

[12]  Lila L. Gatlin,et al.  Information theory and the living system , 1972 .

[13]  Aleksandr Yakovlevich Khinchin,et al.  Mathematical foundations of information theory , 1959 .

[14]  Zu-Guo Yu,et al.  Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses. , 2004, Journal of theoretical biology.

[15]  A. Rényi On Measures of Entropy and Information , 1961 .

[16]  Jonas S. Almeida,et al.  Universal sequence map (USM) of arbitrary discrete sequences , 2002, BMC Bioinformatics.

[17]  P. Lio’,et al.  High statistics block entropy measures of DNA sequences. , 1996, Journal of theoretical biology.

[18]  V. R. Chechetkin,et al.  LEVELS OF ORDERING IN CODING AND NONCODING REGIONS OF DNA SEQUENCES , 1996 .

[19]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[20]  H. Herzel,et al.  Estimating the entropy of DNA sequences. , 1997, Journal of theoretical biology.

[21]  H E Stanley,et al.  Linguistic features of noncoding DNA sequences. , 1994, Physical review letters.

[22]  H Herzel,et al.  Information content of protein sequences. , 2000, Journal of theoretical biology.

[23]  Jonas S. Almeida,et al.  Comparative evaluation of word composition distances for the recognition of SCOP relationships , 2004, Bioinform..

[24]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[25]  Peter Tiño,et al.  Predicting the Future of Discrete Sequences from Fractal Representations of the Past , 2001, Machine Learning.

[26]  P. Deschavanne,et al.  Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. , 1999, Molecular biology and evolution.

[27]  H E Stanley,et al.  Scaling features of noncoding DNA. , 1999, Physica A.

[28]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[29]  W. Ebeling,et al.  Finite sample effects in sequence analysis , 1994 .

[30]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[31]  David Loewenstern,et al.  Significantly Lower Entropy Estimates for Natural DNA Sequences , 1999, J. Comput. Biol..

[32]  Chu-yu Zhang,et al.  A New Method Based on Entropy Theory for Genomic Sequence Analysis , 2002, Acta biotheoretica.

[33]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[34]  J. Oliver,et al.  Entropic profiles of DNA sequences through chaos-game-derived images. , 1993, Journal of theoretical biology.

[35]  M. Sagot,et al.  Promoter sequences and algorithmical methods for identifying them. , 1999, Research in microbiology.

[36]  Maxime Crochemore,et al.  Zones of Low Entropy in Genomic Sequences , 1999, Comput. Chem..

[37]  P. Tiňo Multifractal properties of Hao's geometric representations of DNA sequences , 2002 .

[38]  Ebeling,et al.  Entropies of biosequences: The role of repeats. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[39]  Michael F. Barnsley,et al.  Fractals everywhere , 1988 .

[40]  A Hariri,et al.  On the validity of Shannon-information calculations for molecular biological sequences. , 1990, Journal of theoretical biology.

[41]  En-Hui Yang,et al.  Estimating DNA sequence entropy , 2000, SODA '00.

[42]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[43]  Michael G Sadovsky,et al.  The method to compare nucleotide sequences based on the minimum entropy principle , 2003, Bulletin of mathematical biology.

[44]  Serap A. Savari,et al.  On the entropy of DNA: algorithms and measurements based on memory and rapid convergence , 1995, SODA '95.

[45]  José Carlos Príncipe,et al.  Information Theoretic Clustering , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[46]  Gad M. Landau,et al.  Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity , 2002, Bioinform..