A semantic similarity metric combining features and intrinsic information content

In many research fields such as Psychology, Linguistics, Cognitive Science and Artificial Intelligence, computing semantic similarity between words is an important issue. In this paper a new semantic similarity metric, that exploits some notions of the feature-based theory of similarity and translates it into the information theoretic domain, which leverages the notion of Information Content (IC), is presented. In particular, the proposed metric exploits the notion of intrinsic IC which quantifies IC values by scrutinizing how concepts are arranged in an ontological structure. In order to evaluate this metric, an on line experiment asking the community of researchers to rank a list of 65 word pairs has been conducted. The experiment's web setup allowed to collect 101 similarity ratings and to differentiate native and non-native English speakers. Such a large and diverse dataset enables to confidently evaluate similarity metrics by correlating them with human assessments. Experimental evaluations using WordNet indicate that the proposed metric, coupled with the notion of intrinsic IC, yields results above the state of the art. Moreover, the intrinsic IC formulation also improves the accuracy of other IC-based metrics. In order to investigate the generality of both the intrinsic IC formulation and proposed similarity metric a further evaluation using the MeSH biomedical ontology has been performed. Even in this case significant results were obtained. The proposed metric and several others have been implemented in the Java WordNet Similarity Library.

[1]  Rada Mihalcea,et al.  Unsupervised graph-based word sense disambiguation , 2009 .

[2]  Myoung-Ho Kim,et al.  Information Retrieval Based on Conceptual Distance in is-a Hierarchies , 1993, J. Documentation.

[3]  Boi Faltings,et al.  OSS: A Semantic Similarity Function based on Hierarchical Ontologies , 2007, IJCAI.

[4]  Hai Jin,et al.  SemreX: Efficient search in a semantic overlay for literature retrieval , 2008, Future Gener. Comput. Syst..

[5]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[6]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[7]  Max J. Egenhofer,et al.  Determining Semantic Similarity among Entity Classes from Different Ontologies , 2003, IEEE Trans. Knowl. Data Eng..

[8]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[9]  Edwina L. Rissland,et al.  AI and Similarity , 2006, IEEE Intelligent Systems.

[10]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[11]  Nuno Alexandre,et al.  Computational Models of Similarity in Lexical Ontologies , 2005 .

[12]  B. Schaeffer,et al.  Semantic similarity and the comparison of word meanings. , 1969 .

[13]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[14]  Heiner Stuckenschmidt,et al.  Repairing Ontology Mappings , 2007, AAAI.

[15]  Domenico Talia,et al.  SECCO: On Building Semantic Links in Peer-to-Peer Networks , 2009, J. Data Semant..

[16]  Euripides G. M. Petrakis,et al.  Information Retrieval by Semantic Similarity , 2006, Int. J. Semantic Web Inf. Syst..

[17]  V. Barnett,et al.  Applied Linear Statistical Models , 1975 .

[18]  Mário J. Silva,et al.  Measuring semantic similarity between Gene Ontology terms , 2007, Data Knowl. Eng..

[19]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[20]  Tony Veale,et al.  An Intrinsic Information Content Metric for Semantic Similarity in WordNet , 2004, ECAI.

[21]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[22]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[23]  Michael H. Kutner Applied Linear Statistical Models , 1974 .

[24]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[25]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[26]  Graeme Hirst,et al.  Lexical chains as representations of context for the detection and correction of malapropisms , 1995 .

[27]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[28]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[29]  A. Tversky Features of Similarity , 1977 .

[30]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[31]  Angelos Hliaoutakis,et al.  Semantic Similarity Measures in MeSH Ontology and their application to Information Retrieval on Medline , 2005 .

[32]  S. Globali,et al.  IEEE INTELLIGENT SYSTEMS , 2022, IEEE MultiMedia.

[33]  Nuno Seco,et al.  Design, Implementation and Evaluation of a New Semantic Similarity Metric Combining Features and Intrinsic Information Content , 2008, OTM Conferences.

[34]  Eric R. Ziegel,et al.  Probability and Statistics for Engineering and the Sciences , 2004, Technometrics.

[35]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.

[36]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .