Design, Implementation and Evaluation of a New Semantic Similarity Metric Combining Features and Intrinsic Information Content

In many research fields such as Psychology, Linguistics, Cognitive Science, Biomedicine, and Artificial Intelligence, computing semantic similarity between words is an important issue. In this paper we present a new semantic similarity metric that exploits some notions of the early work done using a feature based theory of similarity, and translates it into the information theoretic domain which leverages the notion of Information Content (IC). In particular, the proposed metric exploits the notion of intrinsic IC which quantifies IC values by scrutinizing how concepts are arranged in an ontological structure. In order to evaluate this metric, we conducted an on line experiment asking the community of researchers to rank a list of 65 word pairs. The experiment's web setup allowed to collect 101 similarity ratings, and to differentiate native and non-native English speakers. Such a large and diverse dataset enables to confidently evaluate similarity metrics by correlating them with human assessments. Experimental evaluations using WordNet indicate that our metric, coupled with the notion of intrinsic IC, yields results above the state of the art. Moreover, the intrinsic IC formulation also improves the accuracy of other IC based metrics. We implemented our metric and several others in the Java WordNet Similarity Library.

[1]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[2]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[3]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[4]  Hai Jin,et al.  SemreX: Efficient search in a semantic overlay for literature retrieval , 2008, Future Gener. Comput. Syst..

[5]  Tony Veale,et al.  An Intrinsic Information Content Metric for Semantic Similarity in WordNet , 2004, ECAI.

[6]  Euripides G. M. Petrakis,et al.  Information Retrieval by Semantic Similarity , 2006, Int. J. Semantic Web Inf. Syst..

[7]  Graeme Hirst,et al.  Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures , 2004 .

[8]  Nuno Alexandre,et al.  Computational Models of Similarity in Lexical Ontologies , 2005 .

[9]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[10]  Angela Schwering,et al.  Hybrid Model for Semantic Similarity Measurement , 2005, OTM Conferences.

[11]  A. Tversky Features of Similarity , 1977 .

[12]  Rada Mihalcea,et al.  Unsupervised Graph-basedWord Sense Disambiguation Using Measures of Word Semantic Similarity , 2007 .

[13]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[14]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[15]  Heiner Stuckenschmidt,et al.  Repairing Ontology Mappings , 2007, AAAI.

[16]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[17]  Graeme Hirst,et al.  Lexical chains as representations of context for the detection and correction of malapropisms , 1995 .

[18]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[19]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[20]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.

[21]  Max J. Egenhofer,et al.  Determining Semantic Similarity among Entity Classes from Different Ontologies , 2003, IEEE Trans. Knowl. Data Eng..

[22]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[23]  Myoung-Ho Kim,et al.  Information Retrieval Based on Conceptual Distance in is-a Hierarchies , 1993, J. Documentation.

[24]  Domenico Talia,et al.  SECCO: On Building Semantic Links in Peer-to-Peer Networks , 2009, J. Data Semant..

[25]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[26]  B. Schaeffer,et al.  Semantic similarity and the comparison of word meanings. , 1969 .

[27]  Edwina L. Rissland,et al.  AI and Similarity , 2006, IEEE Intelligent Systems.

[28]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[29]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.