A Web Search Engine-Based Approach to Measure Semantic Similarity between Words

Measuring the semantic similarity between words is an important component in various tasks on the web such as relation extraction, community mining, document clustering, and automatic metadata extraction. Despite the usefulness of semantic similarity measures in these applications, accurately measuring semantic similarity between two words (or entities) remains a challenging task. We propose an empirical method to estimate semantic similarity using page counts and text snippets retrieved from a web search engine for two words. Specifically, we define various word co-occurrence measures using page counts and integrate those with lexical patterns extracted from text snippets. To identify the numerous semantic relations that exist between two given words, we propose a novel pattern extraction algorithm and a pattern clustering algorithm. The optimal combination of page counts-based co-occurrence measures and lexical pattern clusters is learned using support vector machines. The proposed method outperforms various baselines and previously proposed web-based semantic similarity measures on three benchmark data sets showing a high correlation with human ratings. Moreover, the proposed method significantly improves the accuracy in a community mining task.

[1]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[2]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[3]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[4]  Philip E. Gill,et al.  Practical optimization , 1981 .

[5]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[6]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[7]  Thad Hughes,et al.  Lexical Semantic Relatedness with Random Graph Walks , 2007, EMNLP.

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[9]  Eugene Charniak,et al.  Finding Parts in Very Large Corpora , 1999, ACL.

[10]  Jeffrey P. Bigham,et al.  Organizing and Searching the World Wide Web of Facts - Step One: The One-Million Fact Extraction Challenge , 2006, AAAI.

[11]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[12]  John A. Keane,et al.  Using Web-Search Results to Measure Word-Group Similarity , 2008, COLING.

[13]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[14]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[15]  Kôiti Hasida,et al.  POLYPHONET: An advanced social network extraction system from the Web , 2007, J. Web Semant..

[16]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[17]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[18]  Graeme Hirst,et al.  Lexical chains as representations of context for the detection and correction of malapropisms , 1995 .

[19]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[20]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[21]  Daniel Jurafsky,et al.  Learning Syntactic Patterns for Automatic Hypernym Discovery , 2004, NIPS.

[22]  James Curran,et al.  Ensemble Methods for Automatic Thesaurus Extraction , 2002, EMNLP.

[23]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[24]  Danushka Bollegala,et al.  Disambiguating Personal Names on the Web Using Automatically Extracted Key Phrases , 2006, ECAI.

[25]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[26]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[27]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[28]  Kôiti Hasida,et al.  POLYPHONET: an advanced social network extraction system from the web , 2006, WWW '06.

[29]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[30]  Rahul Bhagat,et al.  Large Scale Acquisition of Paraphrases for Learning Surface Patterns , 2008, ACL.

[31]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[32]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[33]  Mario Jarmasz,et al.  Roget's Thesaurus as a Lexical Resource for Natural Language Processing , 2012, ArXiv.

[34]  Boi Faltings,et al.  OSS: A Semantic Similarity Function based on Hierarchical Ontologies , 2007, IJCAI.

[35]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[36]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[37]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[38]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[39]  Hsin-Hsi Chen,et al.  Novel Association Measures Using Web Search with Double Checking , 2006, ACL.

[40]  Adam Kilgarriff Googleology is Bad Science , 2007, Computational Linguistics.

[41]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[42]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[43]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[44]  Frank Keller,et al.  Using the Web to Obtain Frequencies for Unseen Bigrams , 2003, CL.

[45]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[46]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..