Representing Documents Using an Explicit Model of Their Similarities

A method is proposed for creating vector space representations of documents based on modeling target inter-document similarity values. The target similarity values are assumed to capture semantic relationships, or associations, between the documents. The vector representations are chosen so that the inner product similarities between document vector pairs closely match their target inter-document similarities. The method is closely related to the Latent Semantic Indexing approach; in fact, they are equivalent when the target similarities are derived directly from document similarities based on term co-occurrence. However, our method allows for external sources of inter-document semantic constraints to be used in the indexing, though at greater computational expense. The method is applied to three standard text databases from the information retrieval literature. On the CISI database of information science abstracts, performance (measured by precision averaged over a range of recall levels) improves by 28% compared to a weighted term-vector approach, and improves 10% compared to Latent Semantic Indexing. Similar improvement is obtained on the Cranneld database, but no improve-1 Bartell, p. 2 ment is obtained for the artiicial MED database of medical abstracts. The generally favorable performance suggests interesting potential for methods which explicitly modify the retrieval system to meet inter-document semantic constraints.

[1]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[2]  Yiyu Yao,et al.  Computation of term associations by a neural network , 1993, SIGIR.

[3]  J. Cullum,et al.  Lanczos algorithms for large symmetric eigenvalue computations , 1985 .

[4]  G. Stewart Introduction to matrix computations , 1973 .

[5]  Ellen M. Voorhees,et al.  Using WordNet to disambiguate word senses for text retrieval , 1993, SIGIR.

[6]  R. Nosofsky Stimulus bias, asymmetric similarity, and classification , 1991, Cognitive Psychology.

[7]  Susan T. Dumais,et al.  Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval , 1990 .

[8]  Yiming Yang,et al.  An application of least squares fit mapping to text information retrieval , 1993, SIGIR.

[9]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[10]  Roger W. Schvaneveldt,et al.  Using pathfinder to extract semantic information from text , 1990 .

[11]  Abraham Bookstein,et al.  Performance of self-taught documents: exploiting co-relevance structure in a document collection , 1986, SIGIR '86.

[12]  Yiyu Yao,et al.  Query formulation in linear retrieval models , 1990, J. Am. Soc. Inf. Sci..

[13]  Richard Kuehn Belew,et al.  Adaptive information retrieval: machine learning in associative networks (connectionist, free-text, browsing, feedback) , 1986 .

[14]  E. Rosch ON THE INTERNAL STRUCTURE OF PERCEPTUAL AND SEMANTIC CATEGORIES1 , 1973 .

[15]  Paul E. Green,et al.  Multidimensional Scaling: Concepts and Applications , 1989 .

[16]  Donna K. Harman,et al.  An experimental study of factors important in document ranking , 1986, SIGIR '86.

[17]  Donna K. Harman,et al.  Relevance feedback revisited , 1992, SIGIR '92.

[18]  I. Borg Multidimensional similarity structure analysis , 1987 .

[19]  Richard K. Belew,et al.  Adaptive information retrieval: using a connectionist representation to retrieve and learn about documents , 1989, SIGIR '89.

[20]  E. Voorhees The Effectiveness & Efficiency of Agglomerative Hierarchic Clustering in Document Retrieval , 1985 .

[21]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[22]  K. Sparck Jones,et al.  A TEST FOR THE SEPARATION OF RELEVANT AND NON‐RELEVANT DOCUMENTS IN EXPERIMENTAL RETRIEVAL COLLECTIONS , 1973 .

[23]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[24]  Garrison W. Cottrell,et al.  Latent semantic indexing is an optimal special case of multidimensional scaling , 1992, SIGIR '92.

[25]  Paul E. Nelson Site Report for the Text REtrieval Conference , 1992, TREC.

[26]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[27]  Jim E. Everett,et al.  A combined loglinear/MDS model for mapping journals by citation analysis , 1991, J. Am. Soc. Inf. Sci..

[28]  Kui-Lam Kwok,et al.  Query modification and expansion in a network with adaptive architecture , 1991, SIGIR '91.

[29]  Martha W. Evens,et al.  Relational thesauri in information retrieval , 1985, J. Am. Soc. Inf. Sci..

[30]  Audra E. Kosh,et al.  Linear Algebra and its Applications , 1992 .

[31]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[32]  George W. Furnas,et al.  Pictures of relevance: A geometric analysis of similarity measures , 1987, J. Am. Soc. Inf. Sci..

[33]  Yiyu Yao,et al.  An analysis of vector space models based on computational geometry , 1992, SIGIR '92.

[34]  A. Tversky Features of Similarity , 1977 .

[35]  Garrison W. Cottrell,et al.  Automatic combination of multiple ranked retrieval systems , 1994, SIGIR '94.

[36]  Amos Tversky,et al.  Studies of similarity , 1978 .

[37]  Daniel E. Rose A Symbolic and Connectionist Approach To Legal Information Retrieval , 1994 .

[38]  Stephen I. Gallant A Practical Approach for Representing Context and for Performing Word Sense Disambiguation Using Neural Networks , 1991, Neural Computation.

[39]  L. Rips Similarity, typicality, and categorization , 1989 .