A context vector model for information retrieval

In the vector space model for information retrieval, term vectors are pair-wise orthogonal, that is, terms are assumed to be independent. It is well known that this assumption is too restrictive. In this article, we present our work on an indexing and retrieval method that, based on the vector space model, incorporates term dependencies and thus obtains semantically richer representations of documents. First, we generate term context vectors based on the co-occurrence of terms in the same documents. These vectors are used to calculate context vectors for documents. We present different techniques for estimating the dependencies among terms. We also define term weights that can be employed in the model. Experimental results on four text collections (MED, CRANFIELD, CISI, and CACM) show that the incorporation of term dependencies in the retrieval process performs statistically significantly better than the classical vector space model with IDF weights. We also show that the degree of semantic matching versus direct word matching that performs best varies on the four collections. We conclude that the model performs well for certain types of queries and, generally, for information tasks with high recall requirements. Therefore, we propose the use of the context vector model in combination with other, direct word-matching methods.

[1]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[2]  Hsinchun Chen,et al.  A Concept Space Approach to Addressing the Vocabulary Problem in Scientific Information Retrieval: An Experiment on the Worm Community System , 1997, J. Am. Soc. Inf. Sci..

[3]  Robert R. Korfhage,et al.  SIGIR '93 : proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval : Pittsburgh, PA USA , 1993 .

[4]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[5]  Hsinchun Chen,et al.  Automatic Thesaurus Generation for an Electronic Community System , 1995, J. Am. Soc. Inf. Sci..

[6]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[7]  Efthimis N. Efthimiadis,et al.  A user-centred evaluation of ranking algorithms for interactive query expansion , 1993, SIGIR.

[8]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[9]  Evelyne Tzoukermann,et al.  Information retrieval based on context distance and morphology , 1999, SIGIR '99.

[10]  Holger Billhardt Using Term Co-occurrence Data for Document Indexing and Retrieval , 2000 .

[11]  H. Schütze,et al.  Dimensions of meaning , 1992, Supercomputing '92.

[12]  Vijay V. Raghavan,et al.  A critical analysis of vector space model for information retrieval , 1986, J. Am. Soc. Inf. Sci..

[13]  Vijay V. Raghavan,et al.  On modeling of information retrieval concepts in vector spaces , 1987, TODS.

[14]  Arnon Rungsawang DSIR: the First TREC-7 Attempt , 1998, TREC.

[15]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[16]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[17]  Calvin C. Gotlieb,et al.  Semantic Clustering of Index Terms , 1968, J. ACM.

[18]  Vijay V. Raghavan,et al.  On the Necessity of Term Dependence in a Query Space for Weighted Retrieval , 1998, J. Am. Soc. Inf. Sci..

[19]  Ellen M. Voorhees,et al.  The seventh text REtrieval conference (TREC-7) , 1999 .

[20]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[21]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[22]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[23]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.