On the Use of Singular Value Decomposition for Text Retrieval

The use of the Singular Value Decomposition (SVD) has been proposed for text retrieval in several recent works. This technique uses the SVD to project very high dimensional document and query vectors into a low dimensional space. In this new space it is hoped that the underlying structure of the collection is revealed thus enhancing retrieval performance. Theoretical results have provided some evidence for this claim and to some extent experiments have confirmed this. However, these studies have mostly used small test collections and simplified document models. In this work we investigate the use of the SVD on large document collections. We show that, if interpreted as a mechanism for representing the terms of the collection, this technique alone is insufficient for dealing with the variability in term occurrence. Section 2 introduces the text retrieval concepts necessary for our work. A short description of our experimental architecture is presented in Section 3. Section 4 describes how term occurrence variability affects the SVD and then shows how the decomposition influences retrieval performance. A possible way of improving SVD-based techniques is presented in Section 5 and concluded in Section 6.

[1]  H. Simon,et al.  TRLAN User Guide , 1999 .

[2]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 1999 .

[3]  C. Ding A similarity-based probability model for latent semantic indexing , 1999, SIGIR '99.

[4]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[5]  Gene H. Golub,et al.  Matrix computations , 1983 .

[6]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[7]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 2005 .

[8]  V. N. Bogaevski,et al.  Matrix Perturbation Theory , 1991 .

[9]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[10]  Hongyuan Zha,et al.  Analysis of hubs and authorities on the web , 2001 .

[11]  ParallelArchitecturesK. J. Maschho,et al.  A Portable Implementation of ARPACKfor Distributed Memory , 1996 .

[12]  Ellen M. Voorhees,et al.  The Sixth Text REtrieval Conference (TREC-6) , 2000, Inf. Process. Manag..

[13]  J. Powell Mathematical Methods in Physics , 1965 .

[14]  David A. Hull Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[15]  Alan Edelman,et al.  MITMatlab: A Tool for Interactive Supercomputing , 1999, PPSC.

[16]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[17]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[18]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .