Modeling Information Scent: A Comparison of LSA, PMI and GLSA Similarity Measures on Common Tests and Corpora

In this paper we describe a comparison among three systems that estimate semantic similarity between words: Latent Semantic Analysis (Landauer & Dumais, 1997), Pointwise Mutual Information (Turney, 2001), and Generalized Latent Semantic Analysis (Matveeva, Levow, Farahat, & Royer, 2005). We compare all these techniques on a unique corpus (TASA) and, for PMI and GLSA, we also report performance on a larger web-based corpus. The evaluation is carried out through two kinds of tests: (1) synonymy tests, and (2) comparison with human word similarity judgments. The results indicate that for large corpora PMI works best on word similarity tests, and GLSA on synonymy tests. For the smaller TASA corpus, GLSA produced the best performance on most tests. A large corpus improved the performance of PMI, but, in most cases, did not improve that of GLSA.

[1]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[2]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[3]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[4]  Yoshihiko Nitta,et al.  Co-Occurrence Vectors From Corpora vs. Distance Vectors From Dictionaries , 1994, COLING.

[5]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[6]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[7]  Ben Shneiderman,et al.  Readings in information visualization - using vision to think , 1999 .

[8]  Stuart K. Card,et al.  The effect of information scent on searching information: visualizations of large tree structures , 2000, AVI '00.

[9]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[10]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[11]  Ganesh S. Oak Information Visualization Introduction , 2022 .

[12]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[13]  Preslav Nakov,et al.  Towards Deeper Understanding of the LSA Performance , 2003 .

[14]  Julie Chen,et al.  The bloodhound project: automating discovery of web usability issues using the InfoScentπ simulator , 2003, CHI '03.

[15]  Stan Szpakowicz,et al.  Roget's thesaurus and semantic similarity , 2012, RANLP.

[16]  Charles L. A. Clarke,et al.  Frequency Estimates for Statistical Word Similarity Measures , 2003, NAACL.

[17]  Marilyn Hughes Blackmon,et al.  Tool for accurately predicting website navigation problems, non-problems, problem severity, and effectiveness of repairs , 2005, CHI.

[18]  Anthony J. Hornof,et al.  A comparison of LSA, wordNet and PMI-IR for predicting user click behavior , 2005, CHI.

[19]  Peter Pirolli,et al.  Rational Analyses of Information Foraging on the Web , 2005, Cogn. Sci..

[20]  Douglas L. T. Rohde,et al.  An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence , 2005 .

[21]  D. Nelson,et al.  What is preexisting strength? Predicting free association probabilities, similarity ratings, and cued recall probabilities , 2005, Psychonomic bulletin & review.

[22]  Sriram Raghavan,et al.  Stanford WebBase components and applications , 2006, TOIT.

[23]  Peter Pirolli,et al.  Navigation in degree of interest trees , 2006, AVI '06.

[24]  Gina-Anne Levow,et al.  Term representation with Generalized Latent Semantic Analysis , 2007 .

[25]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[26]  Peter Pirolli,et al.  Information Foraging , 2009, Encyclopedia of Database Systems.