Using Web-Search Results to Measure Word-Group Similarity

Semantic relatedness between words is important to many NLP tasks, and numerous measures exist which use a variety of resources. Thus far, such work is confined to measuring similarity between two words (or two texts), and only a handful utilize the web as a corpus. This paper introduces a distributional similarity measure which uses internet search counts and also extends to calculating the similarity within word-groups. The evaluation results are encouraging: for word-pairs, the correlations with human judgments are comparable with state-of-the-art web-search page-count heuristics. When used to measure similarities within sets of 10 words, the results correlate highly (up to 0.8) with those expected. Relatively little comparison has been made between the results of different search-engines. Here, we compare experimental results from Google, Windows Live Search and Yahoo and find noticeable differences.

[1]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[2]  M. de Rijke,et al.  Towards Topic Driven Access to Full Text Documents , 2004, ECDL.

[3]  Frank Keller,et al.  Using the Web to Obtain Frequencies for Unseen Bigrams , 2003, CL.

[4]  Paola Velardi,et al.  Structural semantic interconnections: a knowledge-based approach to word sense disambiguation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[6]  HirstGraeme,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006 .

[7]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[8]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[9]  Rada Mihalcea,et al.  A Method for Word Sense Disambiguation of Unrestricted Text , 1999, ACL.

[10]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[11]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[12]  Bernardo Magnini,et al.  Integrating Subject Field Codes into WordNet , 2000, LREC.

[13]  David J. Weir,et al.  Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity , 2005, CL.

[14]  John A. Keane,et al.  Measuring Topic Homogeneity and its Application to Dictionary-Based Word Sense Disambiguation , 2008, COLING.

[15]  Julie Weeds,et al.  Unsupervised Acquisition of Predominant Word Senses , 2007, CL.

[16]  Carlo Strapparava,et al.  Unsupervised and supervised exploitation of semantic domains in lexical disambiguation , 2004, Comput. Speech Lang..

[17]  Thad Hughes,et al.  Lexical Semantic Relatedness with Random Graph Walks , 2007, EMNLP.

[18]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[19]  Ted Pedersen,et al.  Using Measures of Semantic Relatedness for Word Sense Disambiguation , 2003, CICLing.

[20]  Adam Kilgarriff Googleology is Bad Science , 2007, Computational Linguistics.

[21]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[22]  Gregory Grefenstette,et al.  Web as Corpus , 2003 .

[23]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[24]  Preslav Nakov,et al.  Search Engine Statistics Beyond the n-Gram: Application to Noun Compound Bracketing , 2005, CoNLL.

[25]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[26]  Hsin-Hsi Chen,et al.  Novel Association Measures Using Web Search with Double Checking , 2006, ACL.

[27]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.