Normalized Compression Distance of Multisets with Applications

Pairwise normalized compression distance (NCD) is a parameter-free, feature-free, alignment-free, similarity metric based on compression. We propose an NCD of multisets that is also metric. Previously, attempts to obtain such an NCD failed. For classification purposes it is superior to the pairwise NCD in accuracy and implementation complexity. We cover the entire trajectory from theoretical underpinning to feasible practice. It is applied to biological (stem cell, organelle transport) and OCR classification questions that were earlier treated with the pairwise NCD. With the new method we achieved significantly better results. The theoretic foundation is Kolmogorov complexity.

[1]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[2]  Martin Raff,et al.  Importance of Intrinsic Mechanisms in Cell Fate Decisions in the Developing Rat Retina , 2003, Neuron.

[3]  Leon Gordon Kraft,et al.  A device for quantizing, grouping, and coding amplitude-modulated pulses , 1949 .

[4]  Bin Ma,et al.  Information shared by many objects , 2008, CIKM '08.

[5]  Trevor Hastie,et al.  Regularized linear discriminant analysis and its application in microarrays. , 2007, Biostatistics.

[6]  Brockway McMillan,et al.  Two inequalities implied by unique decipherability , 1956, IRE Trans. Inf. Theory.

[7]  Samantha Jenkins,et al.  Information theory-based software metrics and obfuscation , 2004, J. Syst. Softw..

[8]  Mohammed Bennamoun,et al.  Featureless Data Clustering , 2009 .

[9]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[11]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[12]  Andrej Muchnik,et al.  Conditional complexity and codes , 2002, Theor. Comput. Sci..

[13]  Ian Witten,et al.  Data Mining , 2000 .

[14]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[15]  Andrew R. Cohen,et al.  Computational prediction of neural progenitor cell fates , 2010, Nature Methods.

[16]  Xian Zhang,et al.  Information distance from a question to an answer , 2007, KDD '07.

[17]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[18]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[19]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[21]  Fabrice P Cordelières,et al.  Huntingtin Controls Neurotrophic Support and Survival of Neurons by Enhancing BDNF Vesicular Transport along Microtubules , 2004, Cell.

[22]  Charles A. Micchelli,et al.  On Spectral Learning , 2010, J. Mach. Learn. Res..

[23]  Sally Temple,et al.  Automatic Summarization of Changes in Biological Image Sequences Using Algorithmic Information Theory , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Paul M. B. Vitányi,et al.  Information Distance in Multiples , 2009, IEEE Transactions on Information Theory.

[25]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[26]  Andrew R. Cohen,et al.  Vertebrate neural stem cell segmentation, tracking and lineaging with validation and editing , 2011, Nature Protocols.

[27]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[28]  L. Hood,et al.  Gene expression dynamics in the macrophage exhibit criticality , 2008, Proceedings of the National Academy of Sciences.

[29]  Ming Li Information Distance and Its Extensions , 2011, ALT.

[30]  L. Levin,et al.  THE COMPLEXITY OF FINITE OBJECTS AND THE DEVELOPMENT OF THE CONCEPTS OF INFORMATION AND RANDOMNESS BY MEANS OF THE THEORY OF ALGORITHMS , 1970 .

[31]  Cheng Fang,et al.  Axonal transport analysis using Multitemporal Association Tracking , 2012, Int. J. Comput. Biol. Drug Des..

[32]  András Kocsor,et al.  Sequence analysis Application of compression-based distance measures to protein sequence classification : a methodological study , 2005 .

[33]  Trevor Hastie,et al.  Regularized Discriminant Analysis and Its Application in Microarrays , 2004 .

[34]  Jennifer Jie Xu,et al.  Knowledge Discovery and Data Mining , 2014, Computing Handbook, 3rd ed..

[35]  Ilya Shmulevich,et al.  Critical networks exhibit maximal information diversity in structure-dynamics relationships. , 2008, Physical review letters.

[36]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[37]  Cécile Ané,et al.  Missing the forest for the trees: phylogenetic compression and its implications for inferring complex evolutionary histories. , 2005, Systematic biology.