Measuring semantic similarity between words using web search engines

Semantic similarity measures play important roles in information retrieval and Natural Language Processing. Previous work in semantic web-related applications such as community mining, relation extraction, automatic meta data extraction have used various semantic similarity measures. Despite the usefulness of semantic similarity measures in these applications, robustly measuring semantic similarity between two words (or entities) remains a challenging task. We propose a robust semantic similarity measure that uses the information available on the Web to measure similarity between words or entities. The proposed method exploits page counts and text snippets returned by a Web search engine. We deflne various similarity scores for two given words P and Q, using the page counts for the queries P, Q and P AND Q. Moreover, we propose a novel approach to compute semantic similarity using automatically extracted lexico-syntactic patterns from text snippets. These difierent similarity scores are integrated using support vector machines, to leverage a robust semantic similarity measure. Experimental results on Miller-Charles benchmark dataset show that the proposed measure outperforms all the existing web-based semantic similarity measures by a wide margin, achieving a correlation coe‐cient of 0:834. Moreover, the proposed semantic similarity measure signiflcantly improves the accuracy (F-measure of 0:78) in a community mining task, and in an entity disambiguation task, thereby verifying the capability of the proposed measure to capture semantic similarity using web content.

[1]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[2]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[3]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[4]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[5]  Philip E. Gill,et al.  Practical optimization , 1981 .

[6]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[7]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[8]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[9]  A. Tversky Features of Similarity , 1977 .

[10]  J. Jenkins,et al.  Word association norms , 1964 .

[11]  Danushka Bollegala,et al.  Disambiguating Personal Names on the Web Using Automatically Extracted Key Phrases , 2006, ECAI.

[12]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[13]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[14]  Andrew McCallum,et al.  Disambiguating Web appearances of people in a social network , 2005, WWW '05.

[15]  Eduard Hovy,et al.  Multi-Document Person Name Resolution , 2004 .

[16]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[17]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[18]  Kôiti Hasida,et al.  POLYPHONET: an advanced social network extraction system from the web , 2006, WWW '06.

[19]  Mitsuru Ishizuka,et al.  Graph-based Word Clustering using a Web Search Engine , 2006, EMNLP.

[20]  Kôiti Hasida,et al.  POLYPHONET: An advanced social network extraction system from the Web , 2007, J. Web Semant..

[21]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[22]  Steffen Staab,et al.  Towards the self-annotating web , 2004, WWW '04.

[23]  Frank Keller,et al.  Using the Web to Obtain Frequencies for Unseen Bigrams , 2003, CL.

[24]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[25]  Ron Weiss,et al.  Fast and effective query refinement , 1997, SIGIR '97.

[26]  Julie Weeds,et al.  Finding Predominant Word Senses in Untagged Text , 2004, ACL.

[27]  Jeffrey P. Bigham,et al.  Organizing and Searching the World Wide Web of Facts - Step One: The One-Million Fact Extraction Challenge , 2006, AAAI.

[28]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[29]  D. Gentner,et al.  Respects for similarity , 1993 .

[30]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[31]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[32]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[33]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[34]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[35]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[36]  Hsin-Hsi Chen,et al.  Novel Association Measures Using Web Search with Double Checking , 2006, ACL.

[37]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[38]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[39]  Susumu Horiguchi,et al.  Personal Name Resolution Crossover Documents by a Semantics-Based Approach , 2006, IEICE Trans. Inf. Syst..

[40]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[41]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[42]  Peter Mika,et al.  Ontologies are us: A unified model of social networks and semantics , 2005, J. Web Semant..

[43]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[44]  James Curran,et al.  Ensemble Methods for Automatic Thesaurus Extraction , 2002, EMNLP.

[45]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[46]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[47]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[48]  Mitsuru Ishizuka,et al.  Extracting Keyphrases to Represent Relations in Social Networks from Web , 2007, IJCAI.