Analysis and comparison of information theory-based distances for genomic strings

Genomic string comparison via alignment are widely applied for mining and retrieval of information in biological databases. In some situation, the effectiveness of such alignment based comparison is still unclear, e.g., for sequences with non‐uniform length and with significant shuffling of identical substrings. An alternative approach is the one based on information theory distances. Biological data information content is stored in very long strings of only four characters. In last ten years, several entropic measures have been proposed for genomic string analysis. Notwithstanding their individual merit and experimental validation, to the nest of our knowledge, there is no direct comparison of these different metrics. We shall present four of the most representative alignment‐free distance measures, based on mutual information. Each one has a different origin and expression. Our comparison involves a sort of arrangement, to reduce different concepts to a unique formalism, so as it has been possible to co...

[1]  C. Gissi,et al.  Where do rodents fit? Evidence from the complete mitochondrial genome of Sciurus vulgaris. , 2000, Molecular biology and evolution.

[2]  Wojciech Szpankowski,et al.  Identifying Statistical Dependence in Genomic Sequences via Mutual Information Estimates , 2007, EURASIP J. Bioinform. Syst. Biol..

[3]  Khalid Sayood,et al.  A new sequence distance measure for phylogenetic tree construction , 2003, Bioinform..

[4]  Ioan Tabus,et al.  An efficient normalized maximum likelihood algorithm for DNA sequence compression , 2005, TOIS.

[5]  Bin Ma,et al.  DNACompress: fast and effective DNA sequence compression , 2002, Bioinform..

[6]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[7]  K. G. Srinivasa,et al.  Non-repetitive DNA Sequence Compression Using Memoization , 2006, ISBMDA.

[8]  Zaher Dawy,et al.  Mutual information based distance measures for classification and content recognition with applications to genetics , 2005, IEEE International Conference on Communications, 2005. ICC 2005. 2005.

[9]  Raffaele Giancarlo,et al.  Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.

[10]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[11]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[12]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[13]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[14]  Giovanni Manzini,et al.  A simple and fast DNA compressor , 2004, Softw. Pract. Exp..

[15]  En-Hui Yang,et al.  Estimating DNA sequence entropy , 2000, SODA '00.

[16]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[17]  J A Lake,et al.  A rate-independent technique for analysis of nucleic acid sequences: evolutionary parsimony. , 1987, Molecular biology and evolution.

[18]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[19]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[20]  Toshiko Matsumoto,et al.  Biological sequence compression algorithms. , 2000, Genome informatics. Workshop on Genome Informatics.

[21]  P. Hanus,et al.  Information Theoretic Distance Measures in Phylogenomics , 2007, 2007 Information Theory and Applications Workshop.

[22]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[23]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.