Bacteria DNA sequence compression using a mixture of finite-context models

The ability of finite-context models for compressing DNA sequences has been demonstrated on some recent works. In this paper, we further explore this line, proposing a compression method based on eight finite-context models, with orders from two to sixteen, whose probabilities are averaged using weights calculated through a recursive procedure. The method was tested on a total of 2,338 sequences belonging to bacterial genomes, with sizes ranging from 1,286 to 13,033,779 bases, showing better compression results than the state-of-the-art XM DNA coding algorithm and also faster operation.

[1]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[2]  Ioan Tabus,et al.  An efficient normalized maximum likelihood algorithm for DNA sequence compression , 2005, TOIS.

[3]  Ioan Tabus,et al.  DNA sequence compression using the normalized maximum likelihood model for discrete regression , 2003, Data Compression Conference, 2003. Proceedings. DCC 2003.

[4]  Ioan Tabus,et al.  Normalized maximum likelihood model of order-1 for the compression of DNA sequences , 2007, 2007 Data Compression Conference (DCC'07).

[5]  David Loewenstern,et al.  Significantly Lower Entropy Estimates for Natural DNA Sequences , 1999, J. Comput. Biol..

[6]  Behshad Behzadi,et al.  DNA Compression Challenge Revisited , 2005 .

[7]  Jean-Paul Delahaye,et al.  A guaranteed compression scheme for repetitive DNA sequences , 1996, Proceedings of Data Compression Conference - DCC '96.

[8]  Armando J. Pinho,et al.  DNA coding using finite-context models and arithmetic coding , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Stéphane Grumbach,et al.  Compression of DNA sequences , 1993, [Proceedings] DCC `93: Data Compression Conference.

[10]  Behshad Behzadi,et al.  DNA Compression Challenge Revisited: A Dynamic Programming Approach , 2005, CPM.

[11]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[12]  Trevor I. Dix,et al.  A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[13]  Giovanni Manzini,et al.  A simple and fast DNA compressor , 2004, Softw. Pract. Exp..

[14]  Armando J. Pinho,et al.  Inverted-repeats-aware finite-context models for DNA coding , 2008, 2008 16th European Signal Processing Conference.

[15]  C. Xin,et al.  A compression algorithm for DNA sequences. , 2001, IEEE engineering in medicine and biology magazine : the quarterly magazine of the Engineering in Medicine & Biology Society.