A semidiscrete matrix decomposition for latent semantic indexing information retrieval

The vast amount of textual information available today is useless unless it can be effectively and efficiently searched. The goal in information retrieval is to find documents that are relevant to a given user query. We can represent and document collection by a matrix whose (i, j) entry is nonzero only if the ith term appears in the jth document; thus each document corresponds to a columm vector. The query is also represented as a column vector whose ith term is nonzero only if the ith term appears in the query. We score each document for relevancy by taking its inner product with the query. The highest-scoring documents are considered the most relevant. Unfortunately, this method does not necessarily retrieve all relevant documents because it is based on literal term matching. Latent semantic indexing (LSI) replaces the document matrix with an approximation generated by the truncated singular-value decomposition (SVD). This method has been shown to overcome many difficulties associated with literal term matching. In this article we propose replacing the SVD with the semidiscrete decomposition (SDD). We will describe the SDD approximation, show how to compute it, and compare the SDD-based LSI method to the SVD-based LSI methods. We will show that SDD-based LSI does as well as SVD-based LSI in terms of document retrieval while requiring only one-twentieth the storage and one-half the time to compute each query. We will also show how to update the SDD approximation when documents are added or deleted from the document collection.

[1]  J. H. Wilkinson The algebraic eigenvalue problem , 1966 .

[2]  C. Paige Bidiagonalization of Matrices and Solution of Linear Equations , 1974 .

[3]  Gene H. Golub,et al.  Matrix computations , 1983 .

[4]  Dianne P. O'Leary,et al.  Digital Image Compression by Outer Product Expansion , 1983, IEEE Trans. Commun..

[5]  M. Al-Baali Descent Property and Global Convergence of the Fletcher—Reeves Method with Inexact Line Search , 1985 .

[6]  Willard Miller,et al.  The IMA volumes in mathematics and its applications , 1986 .

[7]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[8]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[9]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[10]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[11]  Donna K. Harman,et al.  Ranking Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[12]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[13]  Michael W. Berry,et al.  SVDPACKC (Version 1.0) User''s Guide , 1993 .

[14]  Gavin W. O''Brien,et al.  Information Management Tools for Updating an SVD-Encoded Indexing Scheme , 1994 .

[15]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI): TREC-3 Report , 1994, TREC.

[16]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[17]  W. Bruce Croft,et al.  Document Retrieval and Routing Using the INQUERY System , 1994, TREC.

[18]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[19]  R. D. Fierro,et al.  Low-Rank Orthogonal Decompositions for Information Retrieval Applications , 1995 .

[20]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[21]  Michael W. Berry,et al.  Low-rank Orthogonal Decompositions for Information Retrieval Applications , 1995, Numer. Linear Algebra Appl..

[22]  Edith Cohen,et al.  Approximating matrix multiplication for pattern recognition tasks , 1997, SODA '97.

[23]  Tamara G. Kolda,et al.  Limited-memory matrix methods with applications , 1997 .

[24]  Tamara G. Kolda,et al.  Latent Semantic Indexing Via a Semi-Discrete Matrix Decomposition , 1999 .

[25]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .