The similarity metric

A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new "normalized information distance," based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it minorizes every computable distance in the class (that is, it is universal in that it discovers all computable similarities). We demonstrate that it is a metric and call it the similarity metric . This theory forms the foundation for a new practical tool. To evidence generality and robustness, we give two distinctive applications in widely divergent areas using standard compression programs like gzip and GenCompress. First, we compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we fully automatically compute the language tree of 52 different languages.

[1]  B. Snel,et al.  Genome phylogeny based on gene content , 1999, Nature Genetics.

[2]  E V Koonin The emerging paradigm and open problems in comparative genomics. , 1999, Bioinformatics.

[3]  Nikolai K. Vereshchagin,et al.  Logical operations and Kolmogorov complexity , 2002, Theor. Comput. Sci..

[4]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[5]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[6]  V. Rich Personal communication , 1989, Nature.

[7]  Tandy J. Warnow,et al.  Estimating true evolutionary distances between genomes , 2001, STOC '01.

[8]  Philip Ball Algorithm makes tongue tree , 2002 .

[9]  Jean-Paul Delahaye,et al.  Transformation distances: a family of dissimilarity measures based on movements of segments , 1999, Bioinform..

[10]  Vittorio Loreto,et al.  Music style and author-ship categorization by informative compressors , 2003 .

[11]  Xin Chen,et al.  A compression algorithm for DNA sequences , 2001, IEEE Engineering in Medicine and Biology Magazine.

[12]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[13]  Xin Chen,et al.  Shared information and program plagiarism detection , 2004, IEEE Transactions on Information Theory.

[14]  David Sankoff,et al.  Exact and approximation algorithms for sorting by reversals, with application to genome rearrangement , 1995, Algorithmica.

[15]  Pavel A. Pevzner,et al.  Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals , 1995, JACM.

[16]  David Sankoff,et al.  Exact and Approximation Algorithms for the Inversion Distance Between Two Chromosomes , 1993, CPM.

[17]  S. Muthukrishnan,et al.  Approximate nearest neighbors and sequence comparison with block operations , 2000, STOC '00.

[18]  Tao Jiang,et al.  A practical algorithm for recovering the best supported edges of an evolutionary tree (extended abstract) , 2000, SODA '00.

[19]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[20]  Jean-Paul Delahaye,et al.  The transformation distance: A dissimilarity measure based an movements of segments , 1998, German Conference on Bioinformatics.

[21]  Nikolai K. Vereshchagin,et al.  Combinatorial interpretation of Kolmogorov complexity , 2000, Proceedings 15th Annual IEEE Conference on Computational Complexity.

[22]  Ronald de Wolf,et al.  Algorithmic clustering of music , 2003, Proceedings of the Fourth International Conference onWeb Delivering of Music, 2004. EDELMUSIC 2004..

[23]  Bin Ma,et al.  Chain letters & evolutionary histories. , 2003, Scientific American.

[24]  Nikolai K. Vereshchagin,et al.  Logical operations and Kolmogorov complexity. II , 2001, Proceedings 16th Annual IEEE Conference on Computational Complexity.

[25]  W. Fitch,et al.  Construction of phylogenetic trees. , 1967, Science.

[26]  Nikolai K. Vereshchagin,et al.  Combinatorial interpretation of Kolmogorov complexity , 2002 .

[27]  Nikolai K. Vereshchagin,et al.  Independent minimum length programs to translate between given strings , 2000, Proceedings 15th Annual IEEE Conference on Computational Complexity.

[28]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[29]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[30]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[31]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[32]  Peter Yianilos,et al.  Normalized Forms for Two Common Metrics , 1991 .

[33]  Péter Gács,et al.  Algorithmic statistics , 2000, IEEE Trans. Inf. Theory.

[34]  R. Richards,et al.  Counting on comparative maps , 1998 .

[35]  HannenhalliSridhar,et al.  Transforming cabbage into turnip , 1999 .

[36]  Uzi Vishkin,et al.  Communication complexity of document exchange , 1999, SODA '00.

[37]  Ming Li,et al.  Reversibility and adiabatic computation: trading time and space for energy , 1996, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[38]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[39]  Nikolai K. Vereshchagin,et al.  Inequalities for Shannon entropies and Kolmogorov complexities , 1997, Proceedings of Computational Complexity. Twelfth Annual IEEE Conference.

[40]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[41]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[42]  S. Pääbo,et al.  Conflict Among Individual Mitochondrial Proteins in Resolving the Phylogeny of Eutherian Orders , 1998, Journal of Molecular Evolution.

[43]  J. Boore,et al.  Big trees from little genomes: mitochondrial gene order as a phylogenetic tool. , 1998, Current opinion in genetics & development.

[45]  S. Fitz-Gibbon,et al.  Whole genome-based phylogenetic analysis of free-living microorganisms. , 1999, Nucleic acids research.

[46]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[47]  J. Adachi,et al.  MOLPHY version 2.3 : programs for molecular phylogenetics based on maximum likelihood , 1996 .

[48]  C. Rajski,et al.  A Metric Space of Discrete Probability Distributions , 1961, Inf. Control..

[49]  John C. Wooley Trends in Computational Biology: A Summary Based on a RECOMB Plenary Lecture, 1999 , 1999, J. Comput. Biol..