Identification of Critical Values in Latent Semantic Indexing

In this chapter we analyze the values used by Latent Sematic Indexing (LSI) for information retrieval. By manipulating the values in the Singular Value Decomposition (SVD) matrices, we find that a significant fraction of the values have little effect on overall performance, and can thus be removed (changed to zero). This allows us to convert the dense term by dimension and document by dimension matrices into sparse matrices by identifying and removing those entries. We empirically show that these entries are unimportant by presenting retrieval and runtime performance results, using seven collections, which show that removal of up 70% of the values in the term by dimension matrix results in similar or improved retrieval performance (as compared to LSI). Removal of 90% of the values degrades retrieval performance slightly for smaller collections, but improves retrieval performance by 60% on the large collection we tested. Our approach additionally has the computational benefit of reducing memory requirements and query response time.

[1]  Susan T. Dumais,et al.  Using LSI for information filtering: TREC-3 experiments , 1995 .

[2]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI) and TREC-2 , 1993, TREC.

[3]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[4]  Haym Hirsh,et al.  Using LSI for text classification in the presence of background text , 2001, CIKM '01.

[5]  William M. Pottenger,et al.  Detecting Patterns in the LSI Term-Term Matrix , 2002 .

[6]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[7]  William M. Pottenger,et al.  A Framework for Understanding LSI Performance , 2004 .

[8]  Michael W. Berry,et al.  Principal Component Analysis for Information Retrieval , 2005 .

[9]  C. Ding A similarity-based probability model for latent semantic indexing , 1999, SIGIR '99.

[10]  Susan T. Dumais,et al.  LSI meets TREC: A Status Report , 1992, TREC.

[11]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[12]  R. E. Story,et al.  An Explanation of the Effectiveness of Latent Semantic Indexing by Means of a Bayesian Regression Model , 1996, Inf. Process. Manag..

[13]  Michael A. Malcolm,et al.  Computer methods for mathematical computations , 1977 .

[14]  Clifford Behrens,et al.  Telcordia LSI Engine: implementation and scalability issues , 2001, Proceedings Eleventh International Workshop on Research Issues in Data Engineering. Document Management for Data Intensive Business and Scientific Applications. RIDE 2001.

[15]  William M. Pottenger,et al.  A framework for understanding Latent Semantic Indexing (LSI) performance , 2006, Inf. Process. Manag..

[16]  Jun Zhang,et al.  Sparsification Strategies in Latent Semantic Indexing , 2003 .

[17]  Chris Ding,et al.  On the Use of Singular Value Decomposition for Text Retrieval , 2000 .

[18]  H. Schütze,et al.  Dimensions of meaning , 1992, Supercomputing '92.

[19]  Peter M. Wiemer-Hastings,et al.  How Latent is Latent Semantic Analysis? , 1999, IJCAI.

[20]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[21]  Jack Dongarra,et al.  Templates for the Solution of Algebraic Eigenvalue Problems , 2000, Software, environments, tools.