Kolmogorov complexity as a data similarity metric: application in mitochondrial DNA

The problem of developing a similarity index for different objects is discussed. The limitations of current metrics are evaluated and discussed. The normalized compression distance, based on the non-computable Kolmogorov complexity, is examined and compared with two alternative measures. A case study consisting of a phylogenetic tree of different mammals is constructed applying this technique with a mitochondrial DNA database.

[1]  Dominik Endres,et al.  A new metric for probability distributions , 2003, IEEE Transactions on Information Theory.

[2]  J. A. Tenreiro Machado,et al.  Bond graph and memristor approach to DNA analysis , 2017 .

[3]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[4]  Allam Apparao,et al.  NORMALIZED DISTANCE MATRIX METHOD FOR CONSTRUCTION OF PHYLOGENETIC TREES USING NEW COMPRESSOR - DNABIT COMPRESS. , 2010 .

[5]  B Walsh,et al.  Estimating the time to the most recent common ancestor for the Y chromosome or mitochondrial DNA for a pair of individuals. , 2001, Genetics.

[6]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[7]  J. A. Tenreiro Machado,et al.  Fractional order description of DNA , 2015 .

[8]  Gunther Heidemann,et al.  The Normalized Compression Distance as a Distance Measure in Entity Identification , 2009, ICDM.

[9]  J. A. Tenreiro Machado,et al.  Fractional dynamics in DNA , 2011 .

[10]  Matko Glunčić,et al.  Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm , 2012, Nucleic acids research.

[11]  Nicu Sebe,et al.  A New Study on Distance Metrics as Similarity Measurement , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[12]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[13]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[14]  Mary J. Leonard,et al.  Nothing in Evolution Makes Sense Except in the Light of DNA , 2010, CBE life sciences education.

[15]  Ivo Provaznik,et al.  Relationship of Bacteria Using Comparison of Whole Genome Sequences in Frequency Domain , 2014 .

[16]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[17]  José António Tenreiro Machado,et al.  Fractional Order Generalized Information , 2014, Entropy.

[18]  Pawan Sinha,et al.  A Perceptually Based Comparison of Image Similarity Metrics , 2011, Perception.

[19]  José António Tenreiro Machado,et al.  Entropy analysis of the DNA code dynamics in human chromosomes , 2011, Comput. Math. Appl..

[20]  Hidefumi Kawakatsu Methods for Evaluating Pictures and Extracting Music by 2D DFA and 2D FFT , 2015, KES.

[21]  Chih-Fong Tsai,et al.  The distance function effect on k-nearest neighbor classification for medical datasets , 2016, SpringerPlus.

[22]  Rajeev K. Azad,et al.  Generalization of Entropy Based Divergence Measures for Symbolic Sequence Analysis , 2014, PloS one.

[23]  Dima Alhadidi,et al.  Secure approximation of edit distance on genomic data , 2017, BMC Medical Genomics.

[24]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[25]  Nikolai K. Vereshchagin,et al.  Kolmogorov Complexity with Error , 2006, STACS.

[26]  Alfonso Ortega,et al.  Common Pitfalls Using the Normalized Compression Distance: What to Watch Out for in a Compressor , 2005, Commun. Inf. Syst..

[27]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[28]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[29]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[30]  A. Carbone,et al.  Information Measure for Long-Range Correlated Sequences: the Case of the 24 Human Chromosomes , 2013, Scientific Reports.

[31]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[32]  Ying Chen,et al.  A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering. , 2014, Journal of theoretical biology.

[33]  Solomon Kullback,et al.  Information Theory and Statistics , 1960 .

[34]  Joseph W. Thornton,et al.  Alternate evolutionary histories in the sequence space of an ancient protein , 2017, Nature.

[35]  J. A. Tenreiro Machado,et al.  Dynamical analysis of compositions , 2011 .

[36]  Paul M. B. Vitányi,et al.  Kolmogorov Complexity and Information Theory. With an Interpretation in Terms of Questions and Answers , 2003, J. Log. Lang. Inf..

[37]  Haizhou Li,et al.  A Comparison of Categorical Attribute Data Clustering Methods , 2014, S+SSPR.

[38]  Armando J. Pinho,et al.  Image similarity using the normalized compression distance based on finite context models , 2011, 2011 18th IEEE International Conference on Image Processing.

[39]  Pere-Pau Vázquez,et al.  Using Normalized Compression Distance for image similarity measurement: an experimental study , 2011, The Visual Computer.

[40]  Paul M. B. Vitányi,et al.  Normalized Compression Distance of Multisets with Applications , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  D. Kendall A Survey of the Statistical Theory of Shape , 1989 .

[42]  Rebecca Schuller Borbely,et al.  On normalized compression distance and large malware , 2015, Journal of Computer Virology and Hacking Techniques.

[43]  Armando J. Pinho,et al.  On the Approximation of the Kolmogorov Complexity for DNA Sequences , 2017, IbPRIA.

[44]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[45]  Elena Deza,et al.  Encyclopedia of Distances , 2014 .