Pictures of relevance: a geometric analysis of similarity measures

We want computer systems that can help us assess the similarity or relevance of existing objects (e.g., documents, functions, commands, etc.) to a statement of our current needs (e.g., the query). Towards this end, a variety of similarity measures have been proposed. However, the relationship between a measure's formula and its performance is not always obvious. A geometric analysis is advanced and its utility demonstrated through its application to six conventional information retrieval similarity measures and a seventh spreading activation measure. All seven similarity measures work with a representational scheme wherein a query and the database objects are represented as vectors of term weights. A geometric analysis characterizes each similarity measure by the nature of its iso‐similarity contours in an n‐space containing query and object vectors. This analysis reveals important differences among the similarity measures and suggests conditions in which these differences will affect retrieval performance. The cosine coefficient, for example, is shown to be insensitive to between‐document differences in the magnitude of term weights while the inner product measure is sometimes overly affected by such differences. The context‐sensitive spreading activation measure may overcome both of these limitations and deserves further study. The geometric analysis is intended to complement, and perhaps to guide, the empirical analysis of similarity measures. © 1987 John Wiley & Sons, Inc.

[1]  M. Ross Quillian,et al.  Retrieval time from semantic memory , 1969 .

[2]  Michael Lesk,et al.  Word-word associations in document retrieval systems , 1969 .

[3]  Karen Sparck Jones Automatic keyword classification for information retrieval , 1971 .

[4]  Allan Collins,et al.  Experiments on semantic memory and language comprehension. , 1972 .

[5]  Karen Spärck Jones Index term weighting , 1973, Inf. Storage Retr..

[6]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[7]  Allan Collins,et al.  A spreading-activation theory of semantic processing , 1975 .

[8]  John R. Anderson Language, Memory, and Thought , 1976 .

[9]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[10]  A. Tversky Features of Similarity , 1977 .

[11]  Michael McGill,et al.  An Evaluation of Factors Affecting Document Ranking by Information Retrieval Systems. , 1979 .

[12]  Michael McGill,et al.  A performance evaluation of similarity measures, document term weighting schemes and representations in a Boolean environment , 1980, SIGIR '80.

[13]  Clement T. Yu,et al.  Term Weighting in Information Retrieval Using the Term Precision Model , 1982, JACM.

[14]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[15]  John R. Anderson The Architecture of Cognition , 1983 .

[16]  George W. Furnas,et al.  Experience with an adaptive indexing scheme , 1985, CHI '85.

[17]  WILLIAM P. JONES,et al.  On the Applied Use of Human Memory Models: The Memory Extender Personal Filing System , 1986, Int. J. Man Mach. Stud..