Exploring Three-Base Periodicity for DNA Compression and Modeling

To explore the three-base periodicity often found in protein-coding DNA regions, we introduce a DNA model based on three deterministic states, where each state implements a finite-context model. The results obtained show compression gains in relation to the single finite-context model counterpart. Additionally, and potentially more interesting than the compression gain on its own, is the observation that the entropy associated to each of the three states differs and that this variation is not the same among the organisms analyzed

[1]  Ioan Tabus,et al.  An efficient normalized maximum likelihood algorithm for DNA sequence compression , 2005, TOIS.

[2]  E. Trifonov,et al.  The pitch of chromatin DNA is reflected in its nucleotide sequence. , 1980, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Nikolay V. Dokholyan,et al.  Distribution of Base Pair Repeats in Coding and Noncoding DNA Sequences , 1997 .

[4]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[5]  C. Xin,et al.  A compression algorithm for DNA sequences. , 2001, IEEE engineering in medicine and biology magazine : the quarterly magazine of the Engineering in Medicine & Biology Society.

[6]  Stéphane Grumbach,et al.  Compression of DNA sequences , 1993, [Proceedings] DCC `93: Data Compression Conference.

[7]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[8]  Gajendra P. S. Raghava,et al.  Locating probable genes using Fourier transform approach , 2002, Bioinform..

[9]  Xin Chen,et al.  A compression algorithm for DNA sequences , 2001, IEEE Engineering in Medicine and Biology Magazine.

[10]  Virginia Walbot,et al.  A green chapter in the book of life , 2000, Nature.

[11]  Bin Ma,et al.  DNACompress: fast and effective DNA sequence compression , 2002, Bioinform..

[12]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[13]  Toshiko Matsumoto,et al.  Biological sequence compression algorithms. , 2000, Genome informatics. Workshop on Genome Informatics.

[14]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[15]  Jean-Paul Delahaye,et al.  A guaranteed compression scheme for repetitive DNA sequences , 1996, Proceedings of Data Compression Conference - DCC '96.

[16]  Giovanni Manzini,et al.  A simple and fast DNA compressor , 2004, Softw. Pract. Exp..

[17]  Vera Afreixo,et al.  Spectrum and symbol distribution of nucleotide sequences. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.