Understanding and Enhancing the Folding-In Method in Latent Semantic Indexing

Latent Semantic Indexing(LSI) has been proved to be effective to capture the semantic structure of document collections. It is widely used in content-based text retrieval. However, in many real-world applications dealing with very large document collections, LSI suffers from its high computational complexity, which comes from the process of Singular Value Decomposition(SVD). As a result, in practice, the folding-in method is widely used as an approximation to the LSI method. However, in practice, the folding-in method is generally implemented as is and detailed analysis on its effectiveness and performance is left out. Consequentially, the performance of the folding-in method cannot be guaranteed. In this paper, we firstly illustrated the underlying principle of the folding-in method from a linear algebra point of view and analyzed some existing commonly used techniques. Based on the theoretical analysis, we proposed a novel algorithm to guide the implementation of the folding-in method. Our method was justified and evaluated by a series of experiments on various classical IR data sets. The results indicated that our method was effective and had consistent performance over different document collections.

[1]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI) and TREC-2 , 1993, TREC.

[2]  Clifford Behrens,et al.  Telcordia LSI Engine: implementation and scalability issues , 2001, Proceedings Eleventh International Workshop on Research Issues in Data Engineering. Document Management for Data Intensive Business and Scientific Applications. RIDE 2001.

[3]  Sandhya Dwarkadas,et al.  On scaling latent semantic indexing for large peer-to-peer systems , 2004, SIGIR '04.

[4]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI): TREC-3 Report , 1994, TREC.

[5]  Eric P. Jiang,et al.  Lanczos and the Riemannian SVD in information retrieval applications , 2005, Numer. Linear Algebra Appl..

[6]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[7]  Susan T. Dumais,et al.  LSI meets TREC: A Status Report , 1992, TREC.

[8]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[9]  Gavin W. O''Brien,et al.  Information Management Tools for Updating an SVD-Encoded Indexing Scheme , 1994 .

[10]  Susan T. Dumais,et al.  Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval , 1990 .

[11]  Gene H. Golub,et al.  Matrix computations , 1983 .

[12]  William M. Pottenger,et al.  A Framework for Understanding LSI Performance , 2004 .

[13]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[14]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..