Normalized Compression Distance of Multiples

Normalized compression distance (NCD) is a parameter-free similarity measure based on compression. The NCD between pairs of objects is not sufficient for al l applications. We propose an NCD of finite multisets (multiples) of objacts that is metric and is bette r for many applications. Previously, attempts to obtain such an NCD failed. We use the theoretical notion of Kolmogorov complexity that for practical purposes is approximated from above by the length of the compressed version of the file involved, using a real-world compression program. We applied the new NCD for multiples to retinal progenitor cell questions that were earlier treated with the pairwise NCD. Here we get significantly better results. We also applied the NCD for multiples to synthetic time sequence data. The preliminary results are as good as nearest neighbor Euclidean classifier. Index Terms— Normalized compression distance, multisets or multiples, pattern recognition, data mining, similarity, Kolmogorov complexity, retinal progenitor cell classification, synthetic data classification

[1]  Xian Zhang,et al.  Information distance from a question to an answer , 2007, KDD '07.

[2]  L. Levin,et al.  THE COMPLEXITY OF FINITE OBJECTS AND THE DEVELOPMENT OF THE CONCEPTS OF INFORMATION AND RANDOMNESS BY MEANS OF THE THEORY OF ALGORITHMS , 1970 .

[3]  Cheng Fang,et al.  Axonal transport analysis using Multitemporal Association Tracking , 2012, Int. J. Comput. Biol. Drug Des..

[4]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[5]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[6]  Martin Raff,et al.  Importance of Intrinsic Mechanisms in Cell Fate Decisions in the Developing Rat Retina , 2003, Neuron.

[7]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[8]  Andrew R. Cohen,et al.  Vertebrate neural stem cell segmentation, tracking and lineaging with validation and editing , 2011, Nature Protocols.

[9]  Leon Gordon Kraft,et al.  A device for quantizing, grouping, and coding amplitude-modulated pulses , 1949 .

[10]  Cécile Ané,et al.  Missing the forest for the trees: phylogenetic compression and its implications for inferring complex evolutionary histories. , 2005, Systematic biology.

[11]  Dan Klein,et al.  Spectral Learning , 2003, IJCAI.

[12]  Sally Temple,et al.  Automatic Summarization of Changes in Biological Image Sequences Using Algorithmic Information Theory , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  L. Hood,et al.  Gene expression dynamics in the macrophage exhibit criticality , 2008, Proceedings of the National Academy of Sciences.

[14]  Paul M. B. Vitányi,et al.  Information Distance in Multiples , 2009, IEEE Transactions on Information Theory.

[15]  Ming Li,et al.  Information Distance and Its Extensions , 2011, Discovery Science.

[16]  András Kocsor,et al.  Sequence analysis Application of compression-based distance measures to protein sequence classification : a methodological study , 2005 .

[17]  Brockway McMillan,et al.  Two inequalities implied by unique decipherability , 1956, IRE Trans. Inf. Theory.

[18]  Bin Ma,et al.  Information shared by many objects , 2008, CIKM '08.

[19]  Samantha Jenkins,et al.  Information theory-based software metrics and obfuscation , 2004, J. Syst. Softw..

[21]  Mohammed Bennamoun,et al.  Featureless Data Clustering , 2009 .

[22]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[23]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[24]  Andrew R. Cohen,et al.  Computational prediction of neural progenitor cell fates , 2010, Nature Methods.

[25]  Ilya Shmulevich,et al.  Critical networks exhibit maximal information diversity in structure-dynamics relationships. , 2008, Physical review letters.

[26]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[27]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[28]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[29]  Andrej Muchnik,et al.  Conditional complexity and codes , 2002, Theor. Comput. Sci..