Information Distance and Its Extensions

Consider, in the most general sense, the space of all information carrying objects: a book, an article, a name, a definition, a genome, a letter, an image, an email, a webpage, a Google query, an answer, a movie, a music score, a Facebook blog, a short message, or even an abstract concept. Over the past 20 years, we have been developing a general theory of information distance in this space and applications of this theory. The theory is object-independent and application-independent. The theory is also unique, in the sense that no other theory is "better". During the past 10 years, such a theory has found many applications. Recently we have introduced two extensions to this theory concerning multiple objects and irrelevant information. This expository article will focus on explaining the main ideas behind this theory, especially these recent extensions, and their applications. We will also discuss some very preliminary applications.

[1]  Samantha Jenkins,et al.  Information theory-based software metrics and obfuscation , 2004, J. Syst. Softw..

[2]  L. Hood,et al.  Gene expression dynamics in the macrophage exhibit criticality , 2008, Proceedings of the National Academy of Sciences.

[3]  Walid Taha,et al.  A New Approach to Data Mining for Software Design , 2004 .

[4]  Tom Arbuckle,et al.  Studying software evolution using artefacts' shared information content , 2011, Sci. Comput. Program..

[5]  Stephanie Wehner,et al.  Analyzing worms and network traffic using compression , 2005, J. Comput. Secur..

[6]  Xin Chen,et al.  Shared information and program plagiarism detection , 2004, IEEE Transactions on Information Theory.

[7]  Ming Li,et al.  Information Distance and its Applications , 2006, Int. J. Found. Comput. Sci..

[8]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[9]  Ronald Fagin,et al.  Relaxing the Triangle Inequality in Pattern Matching , 2004, International Journal of Computer Vision.

[10]  Jean-Paul Delahaye,et al.  Transformation distances: a family of dissimilarity measures based on movements of segments , 1999, Bioinform..

[11]  Ming Li,et al.  A new multiword expression metric and its applications , 2011 .

[12]  Ronald de Wolf,et al.  Algorithmic Clustering of Music Based on String Compression , 2004, Computer Music Journal.

[13]  Paul M. B. Vitányi,et al.  Information Distance in Multiples , 2009, IEEE Transactions on Information Theory.

[14]  John Case,et al.  Computing Entropy for Ortholog Detection , 2004, International Conference on Computational Intelligence.

[15]  Chong Long,et al.  Multi-document Summarization by Information Distance , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[16]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[17]  Xian Zhang,et al.  Information distance from a question to an answer , 2007, KDD '07.

[18]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[19]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[20]  Dennis K. Peters,et al.  Software Documents: Comparison and Measurement , 2007, SEKE.

[21]  Jean-Philippe Vert,et al.  The context-tree kernel for strings , 2005, Neural Networks.

[22]  Bin Ma,et al.  Information shared by many objects , 2008, CIKM '08.

[23]  Luis Filipe Coelho Antunes,et al.  Clustering Fetal Heart Rate Tracings by Compression , 2006, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06).

[24]  Natalio Krasnogor,et al.  Measuring the similarity of protein structures by means of the universal similarity metric , 2004, Bioinform..

[25]  András Kocsor,et al.  Sequence analysis Application of compression-based distance measures to protein sequence classification : a methodological study , 2005 .

[26]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[27]  Bin Ma,et al.  Chain letters & evolutionary histories. , 2003, Scientific American.

[28]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[29]  Ilya Shmulevich,et al.  Critical networks exhibit maximal information diversity in structure-dynamics relationships. , 2008, Physical review letters.

[30]  Xian Zhang,et al.  New Information Distance Measure and Its Application in Question Answering System , 2008, Journal of Computer Science and Technology.

[31]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[32]  Remco C. Veltkamp,et al.  Shape matching: similarity measures and algorithms , 2001, Proceedings International Conference on Shape Modeling and Applications.

[33]  Zhou Wang,et al.  Generic image similarity based on Kolmogorov complexity , 2010, 2010 IEEE International Conference on Image Processing.

[34]  Cécile Ané,et al.  Missing the forest for the trees: phylogenetic compression and its implications for inferring complex evolutionary histories. , 2005, Systematic biology.

[35]  Khalid Sayood,et al.  A new sequence distance measure for phylogenetic tree construction , 2003, Bioinform..