PMING Distance: A Collaborative Semantic Proximity Measure

One of the main problems that emerges in the classic approach to semantics is the difficulty in acquisition and maintenance of ontologies and semantic annotations. On the other hand, the flow of data and documents which are accessible from the Web is continuously fueled by the contribution of millions of users who interact digitally in a collaborative way. Search engines, continually exploring the Web, are therefore the natural source of information on which to base a modern approach to semantic annotation. A promising idea is that it is possible to generalize the semantic similarity, under the assumption that semantically similar terms behave similarly, and define collaborative proximity measures based on the indexing information returned by search engines. In this work PMING, a new collaborative proximity measure based on search engines, which uses the information provided by search engines, is introduced as a basis to extract semantic content. PMING is defined on the basis of the best features of other state-of-the-art proximity distances which have been considered. It defines the degree of relatedness between terms, by using only the number of documents returned as result for a query, then the measure dynamically reflects the collaborative change made on the web resources. Experiments held on popular collaborative and generalist engines (e.g. Flickr, Youtube, Google, Bing, Yahoo Search) show that PMING outperforms state-of-the-art proximity measures (e.g. Normalized Google Distance, Flickr Distance etc.), in modeling contexts, modeling human perception, and clustering of semantic associations.

[1]  Raj Rao Nadakuditi,et al.  Graph spectra and the detectability of community structure in networks , 2012, Physical review letters.

[2]  M E J Newman,et al.  Fast algorithm for detecting community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[3]  Alfredo Milani,et al.  PLANNING IN REACTIVE ENVIRONMENTS , 2007, Comput. Intell..

[4]  Mitsuru Ishizuka,et al.  Graph-based Word Clustering using a Web Search Engine , 2006, EMNLP.

[5]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[6]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[7]  Danushka Bollegala,et al.  A Web Search Engine-Based Approach to Measure Semantic Similarity between Words , 2011, IEEE Transactions on Knowledge and Data Engineering.

[8]  Wayne D. Gray,et al.  Basic objects in natural categories , 1976, Cognitive Psychology.

[9]  Nenghai Yu,et al.  Flickr distance , 2008, ACM Multimedia.

[10]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[11]  K. Forster,et al.  REPETITION PRIMING AND FREQUENCY ATTENUATION IN LEXICAL ACCESS , 1984 .

[12]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[13]  Carla Limongelli,et al.  Linear temporal logic as an executable semantics for planning languages , 2007, J. Log. Lang. Inf..

[14]  Jie Yu,et al.  Measuring semantic similarity between words by removing noise and redundancy in web snippets , 2011, Concurr. Comput. Pract. Exp..

[15]  Oren Etzioni,et al.  Moving Up the Information Food Chain: Deploying Softbots on the World Wide Web , 1996, AI Mag..

[16]  F. G. Crookshank,et al.  The meaning of meaning : a study of the influence of language upon thought and of the science of symbolism , 1924 .

[17]  Young-Chon Kim,et al.  A Domain Specific Ontology Based Semantic Web Search Engine , 2011, ArXiv.

[18]  John J. L. Morton,et al.  Interaction of information in word recognition. , 1969 .

[19]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.