Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis

Computing semantic relatedness of natural language texts requires access to vast amounts of common-sense and domain-specific world knowledge. We propose Explicit Semantic Analysis (ESA), a novel method that represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia. We use machine learning techniques to explicitly represent the meaning of any text as a weighted vector of Wikipedia-based concepts. Assessing the relatedness of texts in this space amounts to comparing the corresponding vectors using conventional metrics (e.g., cosine). Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with human judgments: from r = 0.56 to 0.75 for individual words and from r = 0.60 to 0.72 for texts. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users.

[1]  J. Davenport Editor , 1960 .

[2]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[3]  George W. Davidson,et al.  Roget's Thesaurus of English Words and Phrases , 1982 .

[4]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[5]  A. Kitchen,et al.  Knowledge based systems in artificial intelligence , 1985, Proceedings of the IEEE.

[6]  Ramanathan V. Guha,et al.  Building large knowledge-based systems , 1989 .

[7]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[8]  Ramanathan V. Guha,et al.  Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project , 1990 .

[9]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[10]  Gregory Grefenstette,et al.  SEXTANT: Exploring Unexplored Contexts for Semantic Extraction from Syntactic Analysis , 1992, ACL.

[11]  F. Grosjean Language and Cognitive Processes , 1996 .

[12]  Duke Benadom Forward , 1996, Nursing standard (Royal College of Nursing (Great Britain) : 1987).

[13]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[14]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[15]  Alistair Moffat,et al.  Exploring the similarity space , 1998, SIGF.

[16]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[17]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[18]  Lillian Lee,et al.  Measures of Distributional Similarity , 1999, ACL.

[19]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[20]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[21]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[22]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[23]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[24]  Ido Dagan,et al.  Similarity-Based Models of Word Cooccurrence Probabilities , 1998, Machine Learning.

[25]  Michael D. Lee,et al.  An Empirical Evaluation of Models of Text Document Similarity , 2005 .

[26]  J. Giles Internet encyclopaedias go head to head , 2005, Nature.

[27]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[28]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[29]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[30]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[31]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[32]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[33]  Evgeniy Gabrilovich,et al.  Feature generation for textual information retrieval using world knowledge , 2007, SIGF.

[34]  K. Fernow New York , 1896, American Potato Journal.

[35]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.