The primary purpose of an information retrieval (IR) system is to retrieve all the relevant documents, which are relevant to the user query. Latent Semantic Indexing/Analysis (LSI/LSA) based ad hoc document retrieval task investigates the performance of retrieval systems that search a static set of documents using new questions. Performance of LSI has been tested by others for several smaller datasets (e.g. MED, CISI abstracts) however, LSI has not been tested for a large dataset. So, we decided to test LSI for a very large dataset. We used TREC-8 LA Times dataset for our experimentation. We applied three different term weighting schemes and our own stop word list to judge the performance. Recall-precision graph and Coefficient of Variation (CV) were used to evaluate the retrieval performance of LSI based retrieval system. We found tf-idf term weighting scheme performs better than log-entropy and raw term frequency weighting schemes when the test collection became very large.
[1]
Susan T. Dumais,et al.
Improving the retrieval of information from external sources
,
1991
.
[2]
Ankush Gupta,et al.
Latent Semantic Indexing based Intelligent Information Retrieval System for Digital Libraries
,
2006,
J. Comput. Inf. Technol..
[3]
S. T. Dumais,et al.
Using latent semantic analysis to improve access to textual information
,
1988,
CHI '88.
[4]
Hinrich Schütze,et al.
Projections for efficient document clustering
,
1997,
SIGIR '97.
[5]
M. F. Porter,et al.
An algorithm for suffix stripping
,
1997
.
[6]
T. Landauer,et al.
Indexing by Latent Semantic Analysis
,
1990
.