WebSim: A Web-based Semantic Similarity Measure

Semantic similarity measures are important for numerous tasks in natural language processing such as word sense disambiguation, automatic synonym extraction, language modelling and document clustering. We propose a method to measure semantic similarity between two words using information available on the Web. We extract page counts and snippets for the AND query of the two words from a Web search engine. We define numerous similarity scores based on page counts and lexico-syntactic patterns. These similarity scores are integrated using support vector machines to form a robust semantic similarity measure. Proposed method outperforms all existing Webbased semantic similarity measures on Miller-Charles benchmark dataset achieving a high correlation coecient of 0:834 with human ratings.