论文信息 - A guaranteed compression scheme for repetitive DNA sequences

A guaranteed compression scheme for repetitive DNA sequences

We present a text compression scheme dedicated to DNA sequences. The exponential growing of the number of sequences creates a real need for analyzing tools. A specific need emerges for methods that perform sequences classification upon various criteria, one of which is the sequence repetitiveness. A good lossless compression scheme is able to distinguish between "random" and "significative" repeats. Theoretical bases for this statement are found in Kolmogorov complexity theory.

[1] Abraham Lempel,et al. A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[2] Terry A. Welch,et al. A Technique for High-Performance Data Compression , 1984, Computer.

[3] Eric Rivals. Algorithmes de compression et applications à l'analyse de séquences génétiques , 1996 .

[4] Michael Rodeh,et al. Linear Algorithm for Data Compression via String Matching , 1981, JACM.

[5] Stéphane Grumbach,et al. Compression of DNA sequences , 1993, [Proceedings] DCC `93: Data Compression Conference.

[6] Mark Nelson,et al. The Data Compression Book , 2009 .

[7] Mark Nelson,et al. The data compression book : featuring fast, efficient data compression techniques in C , 1991 .

[8] T. Bell,et al. Better OPM/L Text Compression , 1986, IEEE Trans. Commun..

[9] Alberto Apostolico,et al. Robust transmission of unbounded strings using Fibonacci representations , 1987, IEEE Trans. Inf. Theory.

[10] L. Goddard. Information Theory , 1962, Nature.

[11] J. Delahaye. Information, complexité et hasard , 1994 .

[12] Wojciech Rytter,et al. Text Algorithms , 1994 .