Semantic similarity measurement using historical google search patterns

Computing the semantic similarity between terms (or short text expressions) that have the same meaning but which are not lexicographically similar is an important challenge in the information integration field. The problem is that techniques for textual semantic similarity measurement often fail to deal with words not covered by synonym dictionaries. In this paper, we try to solve this problem by determining the semantic similarity for terms using the knowledge inherent in the search history logs from the Google search engine. To do this, we have designed and evaluated four algorithmic methods for measuring the semantic similarity between terms using their associated history search patterns. These algorithmic methods are: a) frequent co-occurrence of terms in search patterns, b) computation of the relationship between search patterns, c) outlier coincidence on search patterns, and d) forecasting comparisons. We have shown experimentally that some of these methods correlate well with respect to human judgment when evaluating general purpose benchmark datasets, and significantly outperform existing methods when evaluating datasets containing terms that do not usually appear in dictionaries.

[1]  Michael E. Lesk Information in data: using the Oxford English dictionary on a computer , 1986, SIGF.

[2]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[3]  David Sánchez,et al.  Ontology-driven web-based semantic similarity , 2010, Journal of Intelligent Information Systems.

[4]  Philip A. Bernstein,et al.  HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching , 2009, Proc. VLDB Endow..

[5]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[6]  F. E. Grubbs Procedures for Detecting Outlying Observations in Samples , 1969 .

[7]  Loet Leydesdorff,et al.  The relation between Pearson's correlation coefficient r and Salton's cosine measure , 2009, ArXiv.

[8]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[9]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[10]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[11]  Joseph Fong,et al.  A relational–XML data warehouse for data aggregation with SQL and XQuery , 2008 .

[12]  Karen Spärck Jones Collective Intelligence: It's All in the Numbers , 2006, IEEE Intelligent Systems.

[13]  Lei Zhang,et al.  A Survey of Opinion Mining and Sentiment Analysis , 2012, Mining Text Data.

[14]  Ling Liu,et al.  Manipulation of online reviews: An analysis of ratings, readability, and sentiments , 2012, Decis. Support Syst..

[15]  Iulia Maries,et al.  Towards an Increase of Collective Intelligence within Organizations Using Trust and Reputation Models , 2009, ICCCI.

[16]  A Min Tjoa,et al.  Automating the Schema Matching Process for Heterogeneous Data Warehouses , 2007, DaWaK.

[17]  David Sánchez,et al.  Web-Based Semantic Similarity: An Evaluation in the Biomedical Domain , 2010, Int. J. Softw. Informatics.

[18]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[19]  Danushka Bollegala,et al.  Mining for personal name aliases on the web , 2008, WWW.

[20]  H. Varian,et al.  Predicting the Present with Google Trends , 2009 .

[21]  Giuseppe Pirrò,et al.  A semantic similarity metric combining features and intrinsic information content , 2009, Data Knowl. Eng..

[22]  Pawan Kumar,et al.  Notice of Violation of IEEE Publication Principles The Anatomy of a Large-Scale Hyper Textual Web Search Engine , 2009 .

[23]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[24]  Birger Hjørland,et al.  Semantics and knowledge organization , 2007, Annu. Rev. Inf. Sci. Technol..

[25]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[26]  Hsinchun Chen,et al.  Identity matching using personal and social identity features , 2011, Inf. Syst. Frontiers.

[27]  George A. Miller,et al.  Using Corpus Statistics and WordNet Relations for Sense Identification , 1998, CL.

[28]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[29]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[30]  Euripides G. M. Petrakis,et al.  X-Similarity: Computing Semantic Similarity between Concepts from Different Ontologies , 2006, J. Digit. Inf. Manag..

[31]  Ted Pedersen,et al.  Using Measures of Semantic Relatedness for Word Sense Disambiguation , 2003, CICLing.

[32]  Ted Pedersen,et al.  Using semantic relatedness for word sense disambiguation , 2002 .

[33]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[34]  Silke Retzer,et al.  Inter-organisational knowledge transfer in social networks: A definition of intermediate ties , 2012, Inf. Syst. Frontiers.

[35]  A.M. Tjoa,et al.  Using Ontologies for Measuring Semantic Similarity in Data Warehouse Schema Matching Process , 2007, 2007 9th International Conference on Telecommunications.

[36]  Danushka Bollegala,et al.  Using Relational Similarity between Word Pairs for Latent Relational Search on the Web , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[37]  Michael Y. Hu,et al.  Forecasting with artificial neural networks: The state of the art , 1997 .

[38]  Ngoc Thanh Nguyen,et al.  Collective Intelligence for Semantic and Knowledge Grid , 2008, J. Univers. Comput. Sci..

[39]  Adriana Vlad,et al.  Revealing Statistical Independence of Two Experimental Data Sets: An Improvement on Spearman's Algorithm , 2006, ICCSA.

[40]  Amit Dhurandhar Improving predictions using aggregate information , 2011, KDD.

[41]  Euripides G. M. Petrakis,et al.  MedSearch: A Retrieval System for Medical Information Based on Semantic Similarity , 2006, ECDL.