The Google Similarity Distance

Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers, the equivalent of "society" is "database," and the equivalent of "use" is "a way to search the database". We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts, we use the World Wide Web (WWW) as the database, and Google as the search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the WWW using Google page counts. The WWW is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87 percent with the expert crafted WordNet categories

[1]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[2]  Marcus Hutter,et al.  Algorithmic Complexity , 1993 .

[3]  Frank Keller,et al.  Using the Web to Obtain Frequencies for Unseen Bigrams , 2003, CL.

[4]  Paul M. B. Vitányi,et al.  Automatic Meaning Discovery Using Google , 2006, Kolmogorov Complexity and Applications.

[5]  H. L. Hardman,et al.  Clustering semantics for hypermedia presentation , 2004 .

[6]  Bin Ma,et al.  Chain letters & evolutionary histories. , 2003, Scientific American.

[7]  Douglas B. Lenat,et al.  Mapping Ontologies into Cyc , 2002 .

[8]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[9]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[10]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[11]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[12]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[13]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[14]  Paul M. B. Vitányi,et al.  A New Quartet Tree Heuristic for Hierarchical Clustering , 2006, Theory of Evolutionary Algorithms.

[15]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[16]  Steffen Staab,et al.  Learning by googling , 2004, SKDD.

[18]  Charles L. A. Clarke,et al.  Frequency Estimates for Statistical Word Similarity Measures , 2003, NAACL.

[19]  Douglas B. Lenat,et al.  CYC: a large-scale investment in knowledge infrastructure , 1995, CACM.

[20]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[21]  Ming Li,et al.  Reversibility and adiabatic computation: trading time and space for energy , 1996, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[22]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[23]  Michael Lesk,et al.  Word-word associations in document retrieval systems , 1969 .

[24]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[25]  George A. Miller,et al.  WordNet: A Lexical Database for the English Language , 2002 .

[26]  Joanna Jongwane Classer musiques, langues, images, textes et génomes , 2007 .

[27]  Luis Filipe Coelho Antunes,et al.  Clustering Fetal Heart Rate Tracings by Compression , 2006, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06).

[28]  Leon Gordon Kraft,et al.  A device for quantizing, grouping, and coding amplitude-modulated pulses , 1949 .

[29]  Pierre-Emmanuel Lacocque,et al.  On the search for meaning , 1982, Journal of Religion and Health.

[30]  Paul D. Bailor,et al.  Synthesis of local search algorithms by algebraic means , 1996, Proceedings of the 11th Knowledge-Based Software Engineering Conference.

[31]  Ronald de Wolf,et al.  Algorithmic Clustering of Music Based on String Compression , 2004, Computer Music Journal.

[32]  James P. Bagrow,et al.  On the Google‐fame of scientists and other populations , 2005 .

[33]  A. N. Kolmogorov Combinatorial foundations of information theory and the calculus of probabilities , 1983 .