Computational Approaches to Measuring the Similarity of Short Contexts : A Review of Applications and Methods

Measuring the similarity of short written contexts is a fundamental problem in Natural Language Processing. This article provides a unifying framework by which short context problems can be categorized both by their intended application and proposed solution. The goal is to show that various problems and methodologies that appear quite different on the surface are in fact very closely related. The axes by which these categorizations are made include the format of the contexts (headed versus headless), the way in which the contexts are to be measured (first-order versus second-order similarity), and the information used to represent the features in the contexts (micro versus macro views). The unifying thread that binds together many short context applications and methods is the fact that similarity decisions must be made between contexts that share few (if any) words in common.

[1]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[2]  Ted Pedersen,et al.  Name Discrimination by Clustering Similar Contexts , 2005, CICLing.

[3]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[4]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[5]  Ted Pedersen,et al.  Name Discrimination and Email Clustering using Unsupervised Clustering and Labeling of Similar Contexts , 2005, IICAI.

[6]  Hans-Peter Frei,et al.  Concept based query expansion , 1993, SIGIR.

[7]  Ted Pedersen,et al.  Automatic Cluster Stopping with Criterion Functions and the Gap Statistic , 2006, NAACL.

[8]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[9]  Esther Levin,et al.  Evaluation of Utility of LSA for Word Sense Discrimination , 2006, HLT-NAACL.

[10]  Padmini Srivasan,et al.  Thesaurus Construction , 1992, Information Retrieval: Data Structures & Algorithms.

[11]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[12]  Ted Pedersen,et al.  Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts , 2006 .

[13]  Yaacov Choueka,et al.  Disambiguation by short contexts , 1985, Comput. Humanit..

[14]  Padmini Srinivasan,et al.  Thesaurus Construction , 1992, Information Retrieval: Data Structures & Algorithms.

[15]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[16]  Ted Pedersen,et al.  Distinguishing Word Senses in Untagged Text , 1997, EMNLP.

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[18]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[19]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[20]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[21]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[22]  Paul Bratley,et al.  Computers and the Humanities , 1978, Computer.

[23]  Carolyn J. Crouch,et al.  Experiments in automatic statistical thesaurus construction , 1992, SIGIR '92.

[24]  Ted Pedersen,et al.  Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces , 2004, CoNLL.

[25]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[26]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[27]  Naglaa Thabet Understanding the Thematic Structure of the Qur'an: An Exploratory Multivariate Approach , 2005, ACL.