Clustering by compression

How do we measure similarity-for example to determine an evolutionary distance or to detect clusters-in data of arbitrary type? We develop a general mathematical theory of universal similarity. We tested it on real-world applications in a wide range of fields: the first completely automatic construction of the phylogeny tree based on whole mitochondrial genomes [5, 61; a completely automatic construction of a language tree for over 50 Euro-Asian languages [6] (for a related independent ad-hoc approach see [l]); to music classification and clustering [4], and to detect computer program plagiarism

[1]  J. Kruskal Nonmetric multidimensional scaling: A numerical method , 1964 .

[2]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[3]  G.G. Langdon,et al.  Data compression , 1988, IEEE Potentials.

[4]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[5]  Peter Yianilos,et al.  Normalized Forms for Two Common Metrics , 1991 .

[6]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[7]  Anil K. Jain,et al.  Feature extraction methods for character recognition-A survey , 1996, Pattern Recognit..

[8]  David S. Watson,et al.  A Machine Learning Approach to Musical Style Recognition , 1997, ICMC.

[9]  S. Pääbo,et al.  Conflict Among Individual Mitochondrial Proteins in Resolving the Phylogeny of Eutherian Orders , 1998, Journal of Molecular Evolution.

[10]  Uzi Vishkin,et al.  Communication complexity of document exchange , 1999, SODA '00.

[11]  Tao Jiang,et al.  A practical algorithm for recovering the best supported edges of an evolutionary tree (extended abstract) , 2000, SODA '00.

[12]  T. Belloni,et al.  A model-independent analysis of the variability of GRS 1915+105 , 2000 .

[13]  Tao Jiang,et al.  A Polynomial Time Approximation Scheme for Inferring Evolutionary Trees from Quartet Topologies and Its Application , 2001, SIAM J. Comput..

[14]  Thomas R. Buckley,et al.  Marsupials and Eutherians reunited: genetic evidence for the Theria hypothesis of mammalian evolution , 2001, Mammalian Genome.

[15]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[16]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[17]  Barry Vercoe,et al.  Folk Music Classification Using Hidden Markov Models , 2001 .

[18]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[19]  Philip Ball Algorithm makes tongue tree , 2002 .

[20]  Axel Janke,et al.  Phylogenetic Analysis of 18S rRNA and the Mitochondrial Genomes of the Wombat, Vombatus ursinus, and the Spiny Anteater, Tachyglossus aculeatus: Increased Support for the Marsupionta Hypothesis , 2002, Journal of Molecular Evolution.

[21]  Jonathan Foote,et al.  Automatic Music Summarization via Similarity Analysis , 2002, ISMIR.

[22]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[23]  Anil Kokaram,et al.  CLASSIFYING MUSIC BY GENRE USING THE WAVELET PACKET TRANSFORM AND A ROUND-ROBIN ENSEMBLE , 2002 .

[24]  Luiz Eduardo Soares de Oliveira,et al.  Automatic Recognition of Handwritten Numerical Strings: A Recognition and Verification Strategy , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Huey-Wen Yien,et al.  Information categorization approach to literary authorship disputes , 2003 .

[26]  Alexander Kraskov,et al.  Hierarchical Clustering Based on Mutual Information , 2003, ArXiv.

[27]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[28]  J. A. Comer,et al.  A novel coronavirus associated with severe acute respiratory syndrome. , 2003, The New England journal of medicine.

[29]  C. Kurtzman,et al.  Phylogenetic circumscription of Saccharomyces, Kluyveromyces and other members of the Saccharomycetaceae, and the proposal of the new genera Lachancea, Nakaseomyces, Naumovia, Vanderwaltozyma and Zygotorulaspora. , 2003, FEMS yeast research.

[30]  Vittorio Loreto,et al.  Music style and author-ship categorization by informative compressors , 2003 .

[31]  Xin Chen,et al.  Shared information and program plagiarism detection , 2004, IEEE Transactions on Information Theory.

[32]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[33]  Ronald de Wolf,et al.  Algorithmic clustering of music , 2003, Proceedings of the Fourth International Conference onWeb Delivering of Music, 2004. EDELMUSIC 2004..

[34]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[35]  Ronald de Wolf,et al.  Algorithmic Clustering of Music Based on String Compression , 2004, Computer Music Journal.