Normalized Information Distance

The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, especially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.

[1]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[2]  Xin Chen,et al.  Shared information and program plagiarism detection , 2004, IEEE Transactions on Information Theory.

[3]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[4]  Douglas B. Lenat,et al.  CYC: a large-scale investment in knowledge infrastructure , 1995, CACM.

[5]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[6]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[7]  Douglas B. Lenat,et al.  Mapping Ontologies into Cyc , 2002 .

[8]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[9]  George A. Miller,et al.  WordNet: A Lexical Database for the English Language , 2002 .

[10]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[11]  Raffaele Giancarlo,et al.  Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.

[12]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[13]  Lloyd Rutledge,et al.  Finding the story: broader applicability of semantics and discourse for hypermedia generation , 2003, HYPERTEXT '03.

[14]  Ronald de Wolf,et al.  Algorithmic Clustering of Music Based on String Compression , 2004, Computer Music Journal.

[15]  Bin Ma,et al.  DNACompress: fast and effective DNA sequence compression , 2002, Bioinform..

[16]  Paul M. B. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1993, Graduate Texts in Computer Science.

[17]  Alfonso Ortega,et al.  Common Pitfalls Using the Normalized Compression Distance: What to Watch Out for in a Compressor , 2005, Commun. Inf. Syst..

[18]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[19]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[20]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[21]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[22]  William I. Gasarch,et al.  Book Review: An introduction to Kolmogorov Complexity and its Applications Second Edition, 1997 by Ming Li and Paul Vitanyi (Springer (Graduate Text Series)) , 1997, SIGACT News.

[23]  Li Wei,et al.  Compression-based data mining of sequential data , 2007, Data Mining and Knowledge Discovery.

[24]  Xian Zhang,et al.  Information distance from a question to an answer , 2007, KDD '07.

[25]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .