Finite-Context Models for DNA Coding

Usually, the purpose of studying data compression algorithms is twofold. The need for efficient storage and transmission is often the main motivation, but underlying every compression technique there is a model that tries to reproduce as closely as possible the information source to be compressed. This model may be interesting on its own, as it can shed light on the statistical properties of the source. DNA data are no exception. We urge to find out efficient methods able to reduce the storage space taken by the impressive amount of genomic data that are continuously being generated. Nevertheless, we also desire to know how the code of life works and what is its structure. Creating good (compression) models for DNA is one of the ways to achieve these goals. Recently, and with the completion of the human genome sequencing, the development of efficient lossless compression methods for DNA sequences gained considerable interest (Behzadi and Le Fessant, 2005; Cao et al., 2007; Chen et al., 2001; Grumbach and Tahi, 1993; Korodi and Tabus, 2005; 2007; Manzini and Rastero, 2004; Matsumoto et al., 2000; Pinho et al., 2006; 2009; 2008; Rivals et al., 1996). For example, the human genome is determined by approximately 3 000 million base pairs (Rowen et al., 1997), whereas the genome of wheat has about 16 000 million (Dennis and Surridge, 2000). Since DNA is based on an alphabet of four different symbols (usually known as nucleotides or bases), namely, Adenine (A), Cytosine (C), Guanine (G), and Thymine (T), without compression it takes approximately 750 MBytes to store the human genome (using log2 4 = 2 bits per symbol) and 4 GBytes to store the genome of wheat. In this chapter, we address the problem of DNA data modeling and coding. We review the main approaches proposed in the literature over the last fifteen years and we present some recent advances attained with finite-context models (Pinho et al., 2006; 2009; 2008). Low-order finite-context models have been used for DNA compression as a secondary, fall back method. However, we have shown that models of orders higher than four are indeed able to attain significant compression performance. Initially, we proposed a three-state finite-context model for DNA protein-coding regions, i.e., for the parts of the DNA that carry information regarding how proteins are synthesized (Ferreira et al., 2006; Pinho et al., 2006). This three-state model proved to be better than a singlestate model, giving additional evidence of a phenomenon that is common in these proteincoding regions, the periodicity of period three.

[1]  Giovanni Manzini,et al.  A simple and fast DNA compressor , 2004, Softw. Pract. Exp..

[2]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[3]  C Dennis,et al.  A. thaliana genome , 2000, Nature.

[4]  Ioan Tabus,et al.  Normalized maximum likelihood model of order-1 for the compression of DNA sequences , 2007, 2007 Data Compression Conference (DCC'07).

[5]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[6]  Ioan Tabus,et al.  An efficient normalized maximum likelihood algorithm for DNA sequence compression , 2005, TOIS.

[7]  Armando J. Pinho,et al.  Inverted-repeats-aware finite-context models for DNA coding , 2008, 2008 16th European Signal Processing Conference.

[8]  Stéphane Grumbach,et al.  Compression of DNA sequences , 1993, [Proceedings] DCC `93: Data Compression Conference.

[9]  Behshad Behzadi,et al.  DNA Compression Challenge Revisited: A Dynamic Programming Approach , 2005, CPM.

[10]  Armando J. Pinho,et al.  DNA coding using finite-context models and arithmetic coding , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  G. Mahairas,et al.  Sequencing the human genome. , 1997, Science.

[12]  Raphail E. Krichevsky,et al.  The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[13]  Trevor I. Dix,et al.  Comparative analysis of long DNA sequences by per element information content using different contexts , 2007, BMC Bioinformatics.

[14]  David Salomon,et al.  Data Compression: The Complete Reference , 2006 .

[15]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[16]  Jean-Paul Delahaye,et al.  A guaranteed compression scheme for repetitive DNA sequences , 1996, Proceedings of Data Compression Conference - DCC '96.

[17]  Toshiko Matsumoto,et al.  Biological sequence compression algorithms. , 2000, Genome informatics. Workshop on Genome Informatics.

[18]  G. Blelloch Introduction to Data Compression * , 2022 .

[19]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[20]  Trevor I. Dix,et al.  A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[21]  Ioan Tabus,et al.  DNA sequence compression using the normalized maximum likelihood model for discrete regression , 2003, Data Compression Conference, 2003. Proceedings. DCC 2003.

[22]  Armando J. Pinho,et al.  A Three-State Model for DNA Protein-Coding Regions , 2006, IEEE Transactions on Biomedical Engineering.

[23]  David Salomon,et al.  Data compression - The Complete Reference, 4th Edition , 2004 .

[24]  Armando J. Pinho,et al.  Exploring Three-Base Periodicity for DNA Compression and Modeling , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[25]  Behshad Behzadi,et al.  DNA Compression Challenge Revisited , 2005 .

[26]  Xin Chen,et al.  A compression algorithm for DNA sequences , 2001, IEEE Engineering in Medicine and Biology Magazine.

[27]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[28]  Bin Ma,et al.  DNACompress: fast and effective DNA sequence compression , 2002, Bioinform..