Computation of the Normalized Compression Distance of DNA Sequences using a Mixture of Finite-context Models

A compression-based similarity measure assesses the similarity between two objects using the number of bits needed to describe one of them when a description of the other is available. For being effective, these measures have to rely on “normal” compression algorithms, roughly meaning that they have to be able to build an internal model of the data being compressed. Often, we find that good “normal” compression methods are slow and those that are fast do not provide acceptable results. In this paper, we propose a method for measuring the similarity of DNA sequences that balances these two goals. The method relies on a mixture of finite-context models and is compared with other methods, including XM, the state-of-the-art DNA compression technique. Moreover, we present a comprehensive study of the inter-chromosomal similarity of the human genome.

[1]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[2]  Trevor I. Dix,et al.  A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[3]  Armando J. Pinho,et al.  Compressing the Human Genome Using Exclusively Markov Models , 2011, PACBB.

[4]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part II , 1964, Inf. Control..

[5]  Armando J. Pinho,et al.  Symbolic to numerical conversion of DNA sequences using finite-context models , 2011, 2011 19th European Signal Processing Conference.

[6]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[7]  Trevor I. Dix,et al.  Comparative analysis of long DNA sequences by per element information content using different contexts , 2007, BMC Bioinformatics.

[8]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[9]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[10]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences , 1966, JACM.

[11]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[12]  Armando J. Pinho,et al.  Bacteria DNA sequence compression using a mixture of finite-context models , 2011, 2011 IEEE Statistical Signal Processing Workshop (SSP).

[13]  Goren Gordon Multi-dimensional Linguistic Complexity , 2003, Journal of biomolecular structure & dynamics.