Typically, information is retrieved by literally matching terms in documents with those of a query. However, lexical matching methods can be inaccurate when they are used to match a user's query. Since there are usually many ways to express a given concept (synonymy), the literal terms in a user's query may not match those of a relevant document. In addition, most words have multiple meanings (polysemy), so terms in a user's query will literally match terms in irrelevant documents. A better approach would allow users to retrieve information on the basis of a conceptual topic or meaning of a document. Latent Semantic Indexing (LSI) [Deerwester et al] tries to overcome the problems of lexical matching by using statistically derived conceptual indices instead of individual words for retrieval. LSI assumes that there is some underlying or latent structure in word usage that is partially obscured by variability in word choice. A truncated singular value decomposition (SVD) is used to estimate the structure in word usage across documents. Retrieval is then performed using the database of singular values and vectors obtained from the truncated SVD. Performance data shows that these statistically derived vectors are more robust indicators of meaning than individual terms. Section 2 is a review of basic concepts needed to understand LSI. In Section 3, a description of some of the advantages and disadvantages of LSI. The effectiveness of LSI has been demonstrated empirically in several text collections as increased average retrieval precision but a theoretical (and quantitative) understanding beyond empirical evidence is desirable. Section 4 describes some of the attempts that have been done in this direction. Finally, in Section 5 some applications of LSI.
[1]
Paul G. Young.
Cross-Language Information Retrieval Using Latent Semantic Indexing
,
1994
.
[2]
H. Schütze,et al.
Dimensions of meaning
,
1992,
Supercomputing '92.
[3]
Susan T. Dumais,et al.
Latent Semantic Indexing (LSI) and TREC-2
,
1993,
TREC.
[4]
Santosh S. Vempala,et al.
Latent semantic indexing: a probabilistic analysis
,
1998,
PODS '98.
[5]
T. Landauer,et al.
Indexing by Latent Semantic Analysis
,
1990
.
[6]
Yiming Yang,et al.
An application of least squares fit mapping to text information retrieval
,
1993,
SIGIR.
[7]
C. Ding.
A similarity-based probability model for latent semantic indexing
,
1999,
SIGIR '99.
[8]
T DumaisSusan,et al.
Using linear algebra for intelligent information retrieval
,
1995
.
[9]
David A. Hull.
Improving text retrieval for the routing problem using latent semantic indexing
,
1994,
SIGIR '94.
[10]
Stephen I. Gallant,et al.
A Practical Approach for Representing Context and for Performing Word Sense Disambiguation Using Neural Networks
,
1991,
Neural Computation.
[11]
WuCathy,et al.
Neural Networks for Full-Scale Protein Sequence Classification
,
1995
.
[12]
Hinrich Schütze,et al.
Book Reviews: Foundations of Statistical Natural Language Processing
,
1999,
CL.
[13]
Richard A. Harshman,et al.
Information retrieval using a singular value decomposition model of latent semantic structure
,
1988,
SIGIR '88.
[14]
Gavin W. O''Brien,et al.
Information Management Tools for Updating an SVD-Encoded Indexing Scheme
,
1994
.