On modeling of information retrieval concepts in vector spaces

The Vector Space Model (VSM) has been adopted in information retrieval as a means of coping with inexact representation of documents and queries, and the resulting difficulties in determining the relevance of a document relative to a given query. The major problem in employing this approach is that the explicit representation of term vectors is not known a priori. Consequently, earlier researchers made the assumption that the vectors corresponding to terms are pairwise orthogonal. Such an assumption is clearly unrealistic. Although attempts have been made to compensate for this assumption by some separate, corrective steps, such methods are ad hoc and, in most cases, formally inconsistent. In this paper, a generalization of the VSM, called the GVSM, is advanced. The developments provide a solution not only for the computation of a measure of similarity (correlation) between terms, but also for the incorporation of these similarities into the retrieval process. The major strength of the GVSM derives from the fact that it is theoretically sound and elegant. Furthermore, experimental evaluation of the model on several test collections indicates that the performance is better than that of the VSM. Experiments have been performed on some variations of the GVSM, and all these results have also been compared to those of the VSM, based on inverse document frequency weighting. These results and some ideas for the efficient implementation of the GVSM are discussed.

[1]  Vijay V. Raghavan,et al.  A critical analysis of vector space model for information retrieval , 1986 .

[2]  Clement T. Yu,et al.  An Evaluation of Term Dependence Models in Information Retrieval , 1982, SIGIR.

[3]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[4]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[5]  Gerard Salton,et al.  Automatic term class construction using relevance--A summary of work in automatic pseudoclassification , 1980, Inf. Process. Manag..

[6]  M. Gordon,et al.  A learning algorithm applied to document redescription , 1985, SIGIR '85.

[7]  C. J. van Rijsbergen,et al.  An Evaluation of feedback in Document Retrieval using Co‐Occurrence Data , 1978, J. Documentation.

[8]  Gerard Salton,et al.  Dynamic information and library processing , 1975 .

[9]  Gerard Salton,et al.  Experiments in Automatic Thesaurus Construction for Information Retrieval , 1971, IFIP Congress.

[10]  Jack Minker,et al.  An evaluation of query expansion by the addition of clustered terms for a document retrieval system , 1972, Inf. Storage Retr..

[11]  P. Zunde,et al.  Indexing Consistency and Quality. , 1969 .

[12]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[13]  Vijay V. Raghavan,et al.  A critical analysis of vector space model for information retrieval , 1986, J. Am. Soc. Inf. Sci..

[14]  Vijay V. Raghavan,et al.  On extending the vector space model for Boolean query processing , 1986, SIGIR '86.

[15]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[16]  Vijay V. Raghavan,et al.  Experiments on the determination of the relationships between terms , 1979, ACM Trans. Database Syst..