Unsupervised Semantic Similarity Computation between Terms Using Web Documents

In this work, Web-based metrics that compute the semantic similarity between words or terms are presented and compared with the state of the art. Starting from the fundamental assumption that similarity of context implies similarity of meaning, relevant Web documents are downloaded via a Web search engine and the contextual information of words of interest is compared (context-based similarity metrics). The proposed algorithms work automatically, do not require any human-annotated knowledge resources, e.g., ontologies, and can be generalized and applied to different languages. Context-based metrics are evaluated both on the Charles-Miller data set and on a medical term data set. It is shown that context-based similarity metrics significantly outperform co-occurrence-based metrics, in terms of correlation with human judgment, for both tasks. In addition, the proposed unsupervised context-based similarity computation algorithms are shown to be competitive with the state-of-the-art supervised semantic similarity algorithms that employ language-specific knowledge resources. Specifically, context-based metrics achieve correlation scores of up to 0.88 and 0.74 for the Charles-Miller and medical data sets, respectively. The effect of stop word filtering is also investigated for word and term similarity computation. Finally, the performance of context-based term similarity metrics is evaluated as a function of the number of Web documents used and for various feature weighting schemes.

[1]  Eduardo Mena,et al.  Querying the web: a multiontology disambiguation method , 2006, ICWE '06.

[2]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[3]  Peter Knees,et al.  Assigning and Visualizing Music Genres by Web-based Co-Occurrence Analysis , 2006, ISMIR.

[4]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[5]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[6]  Hermann Ney,et al.  Exploiting phrasal lexica and additional morpho-syntactic language resources for statistical machine translation with scarce training data , 2005, EAMT.

[7]  A. Potamianos,et al.  Combining statistical similarity measures for automatic induction of semantic classes , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[8]  Jan Korst,et al.  Tagging Artists using Co-Occurrences on the Web , 2006 .

[9]  Euripides G. M. Petrakis,et al.  X-Similarity: Computing Semantic Similarity between Concepts from Different Ontologies , 2006, J. Digit. Inf. Manag..

[10]  Charles L. A. Clarke,et al.  Frequency Estimates for Statistical Word Similarity Measures , 2003, NAACL.

[11]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[12]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[13]  Eric Fosler-Lussier,et al.  Using semantic class information for rapid development of language models within ASR dialogue systems , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[14]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[15]  Peter Mika Ontologies Are Us: A Unified Model of Social Networks and Semantics , 2005, International Semantic Web Conference.

[16]  Alexandros Potamianos,et al.  Unsupervised Semantic Similarity Computation using Web Search Engines , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[17]  Patrick Pantel,et al.  VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations , 2004, EMNLP.

[18]  Ming Zhou,et al.  Identifying Synonyms among Distributionally Similar Words , 2003, IJCAI.

[19]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[20]  Ido Dagan,et al.  Mining Text Using Keyword Distributions , 1998, Journal of Intelligent Information Systems.

[21]  Ido Dagan,et al.  Similarity-Based Methods for Word Sense Disambiguation , 1997, ACL.

[22]  Mitsuru Ishizuka,et al.  Extracting Relations in Social Networks from the Web Using Similarity Between Collective Contexts , 2006, SEMWEB.

[23]  Chin-Hui Lee,et al.  Auto-induced semantic classes , 2004, Speech Commun..

[24]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[25]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[26]  Ana Lelescu,et al.  COBRA - Mining Web for Corporate Brand and Reputation Analysis , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[27]  Eric Fosler-Lussier,et al.  UNSUPERVISED COMBINATION OF METRICS FOR SEMANTIC CLASS INDUCTION , 2006, 2006 IEEE Spoken Language Technology Workshop.

[28]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[29]  Paul M. B. Vitányi,et al.  Universal similarity , 2005, IEEE Information Theory Workshop, 2005..

[30]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[31]  Jimmy J. Lin,et al.  Web question answering: is more always better? , 2002, SIGIR '02.

[32]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[33]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[34]  Sharon Flank,et al.  A Layered Approach to NLP-Based Information Retrieval , 1998, ACL.

[35]  Jianying Wang,et al.  A corpus analysis approach for automatic query expansion , 1997, CIKM '97.

[36]  Ronald Rosenfeld,et al.  Improving trigram language modeling with the World Wide Web , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[37]  Jan O. Pedersen Information Retrieval Based on Word Senses , 1995 .

[38]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[39]  Helen M. Meng,et al.  Semi-automatic acquisition of domain-specific semantic structures , 1999, EUROSPEECH.

[40]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[41]  Rada Mihalcea,et al.  Semantic Indexing using WordNet Senses , 2000 .

[42]  Peter Mika,et al.  Ontologies are us: A unified model of social networks and semantics , 2005, J. Web Semant..

[43]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[44]  Steffen Staab,et al.  Towards the self-annotating web , 2004, WWW '04.