A New Challenge for Compression Algorithms: Genetic Sequences

Universal data compression algorithms fail to compress genetic sequences. It is due to the specificity of this particular kind of “text.” We analyze in some detail the properties of the sequences, which cause the failure of classical algorithms. We then present a lossless algorithm, biocompress-2, to compress the information contained in DNA and RNA sequences, based on the detection of regularities, such as the presence of palindromes. The algorithm combines substitutional and statistical methods, and to the best of our knowledge, leads to the highest compression of DNA. The results, although not satisfactory, give insight to the necessary correlation between compression and comprehension of genetic sequences.

[1]  Rolf Herken,et al.  The Universal Turing Machine: A Half-Century Survey , 1992 .

[2]  Mark Nelson,et al.  The Data Compression Book , 2009 .

[3]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[4]  Jean Thierry-Mieg,et al.  Data and Knowledge Bases for Genome Mapping: What Lies Ahead? (Panel) , 1991, VLDB.

[5]  D. Arquès,et al.  Periodicities in coding and noncoding regions of the genes. , 1990, Journal of theoretical biology.

[6]  Stéphane Grumbach,et al.  Compression of DNA sequences , 1993, [Proceedings] DCC `93: Data Compression Conference.

[7]  B. Dujon,et al.  The complete DNA sequence of yeast chromosome III , 1992, Nature.

[8]  Marc Zipstein Data Compression with Factor Automata , 1992, Theor. Comput. Sci..

[9]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[10]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[11]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[12]  G.G. Langdon,et al.  Data compression , 1988, IEEE Potentials.

[13]  M. Waterman Mathematical Methods for DNA Sequences , 1989 .

[14]  Edward R. Fiala,et al.  Data compression with finite windows , 1989, CACM.

[15]  Zvi Galil,et al.  Optimal Parallel Algorithms for Periods, Palindromes and Squares (Extended Abstract) , 1992, ICALP.

[16]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[17]  Zvi Galil,et al.  Optimal Parallel Algorithms for Periods, Palindromes and Squares (Preliminary Version) , 1991 .

[18]  T. Kirkwood,et al.  Statistical Analysis of Deoxyribonucleic Acid Sequence Data-a Review , 1989 .

[19]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[20]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.