Compression and Sequence Comparison

A novel approach to measure sequences complexity and to objectively quantify two sequences similarity is presented. General genetic sequence complexity can be evaluated using sequence compression, for which we design speciic algorithms. A compression algorithm is an objective criterium to determine wether something is random or structured 5]. Every compression algorithm exploits some precise text regularities, codes them in an economic way, and then outputs a possibly shortened sequence. The fact that the resulting description is shorter than the original text, proves that the algorithm reveals a structure in choosing those regularities. This is the most interesting property of compression. In our work, we do not consider compression as a storage saving process, but as a tool to exhibit the inner structure of a text. Consider the compressor which only deals with direct repeats. It rst looks for inside the input sequence which repeats can be coded, replaces the second occurrence of each repeat by a pointer to the rst and thus performs compression. But also can it compress the input sequence with respect to another sequence, and then measures, by the way of the compression rate, what proportion of the input sequence is signiicantly made of pieces of the reference one. This second use of DNA compression yields to an objective and clearly understandable measurement of sequences similarity (see 6]). In the rst part, we outline the compression basic theoretical concepts enclosed in the Al-gorithmic Information Theory. Part two shows practical attempts of dedicated sequence compression algorithms and their results. The underlying concepts of the Algorithmic Information Theory, worked out in the 60's by A. Kolmogorov and G. Chaitin, warrant the choice of text compression as a general investigation tool for sequence analysis. The core question of this theory is: what is the essential information contained in an object ? Given that any object could be described in a text over f0; 1g, the authors deene the Kolmogorov Complexity of a text, denoted by K, as the length of the smallest program able to generate it 1 5, 3]. This idea strongly tied up the understanding of a text and his representation compression. The more compact representation is its shortest \explanation 2 ". It can be proved that Kol-mogorov Complexity does not depend on the programming language used for the smallest program 3. Hence, K which bounds possible compression, is an absolute measure. A compression algorithm cannot take advantage …