An information retrieval model based on vector space method by supervised learning

This paper proposes a method to improve retrieval performance of the vector space model (VSM) in part by utilizing user-supplied information of those documents that are relevant to the query in question. In addition to the user's relevance feedback information, information such as original document similarities is incorporated into the retrieval model, which is built by using a sequence of linear transformations. High-dimensional and sparse vectors are then reduced by singular value decomposition (SVD) and transformed into a low-dimensional vector space, namely the space representing the latent semantic meanings of words. The method has been tested with two test collections, the Medline collection and the Cranfield collection. In order to train the model, multiple partitions are created for each collection. Improvement of average precision of the averages over all partitions, compared with the latent semantic indexing (LSI) model, are 20.57% (Medline) and 22.23% (Cranfield) for the two training data sets, and 0.47% (Medline) and 4.78% (Cranfield) for the test data, respectively. The proposed method provides an approach that makes it possible to preserve user-supplied relevance information for the long term in the system in order to use it later.

[1]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[2]  C. Ding A similarity-based probability model for latent semantic indexing , 1999, SIGIR '99.

[3]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[4]  Jian-Yun Nie,et al.  Chinese information retrieval: using characters or words? , 1999, Inf. Process. Manag..

[5]  S. Omatu,et al.  Pattern recognition with neural networks , 2000, IGARSS 2000. IEEE 2000 International Geoscience and Remote Sensing Symposium. Taking the Pulse of the Planet: The Role of Remote Sensing in Managing the Environment. Proceedings (Cat. No.00CH37120).

[6]  E. Chisholm,et al.  New Term Weighting Formulas for the Vector Space Method in Information Retrieval , 1999 .

[7]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[8]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[9]  Peter W. Foltz,et al.  Learning Human-like Knowledge by Singular Value Decomposition: A Progress Report , 1997, NIPS.

[10]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[11]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[12]  Christos Faloutsos,et al.  A survey of information retrieval and filtering methods , 1995 .

[13]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[14]  Rafael A. Calvo,et al.  Fast Dimensionality Reduction and Simple PCA , 1998, Intell. Data Anal..

[15]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[16]  Garrison W. Cottrell,et al.  Optimizing parameters in a ranked retrieval system using multi-query relevance feedback , 1994 .

[17]  L. Guttman What is Not What in Statistics , 1977 .

[18]  Susan T. Dumais,et al.  Using latent semantic analysis to improve information retrieval , 1988, CHI 1988.

[19]  Susan T. Dumais,et al.  Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval , 1990 .

[20]  Susan T. Dumais,et al.  Using LSI for information filtering: TREC-3 experiments , 1995 .

[21]  Sophia Ananiadou,et al.  Information retrieval and natural language processing , 1997 .

[22]  Garrison W. Cottrell,et al.  User lenses—achieving 100% precision on frequently asked questions , 1999 .

[23]  Michael W. Berry,et al.  Large-Scale Sparse Singular Value Computations , 1992 .

[24]  Abhijit S. Pandya,et al.  Pattern Recognition with Neural Networks in C++ , 1995 .