论文信息 - Latent semantic indexing and large dataset: Study of term-weighting schemes

Latent semantic indexing and large dataset: Study of term-weighting schemes

The primary purpose of an information retrieval (IR) system is to retrieve all the relevant documents, which are relevant to the user query. Latent Semantic Indexing/Analysis (LSI/LSA) based ad hoc document retrieval task investigates the performance of retrieval systems that search a static set of documents using new questions. Performance of LSI has been tested by others for several smaller datasets (e.g. MED, CISI abstracts) however, LSI has not been tested for a large dataset. So, we decided to test LSI for a very large dataset. We used TREC-8 LA Times dataset for our experimentation. We applied three different term weighting schemes and our own stop word list to judge the performance. Recall-precision graph and Coefficient of Variation (CV) were used to evaluate the retrieval performance of LSI based retrieval system. We found tf-idf term weighting scheme performs better than log-entropy and raw term frequency weighting schemes when the test collection became very large.

A. N. K. Zaman | Charles Grant Brown

[1] Susan T. Dumais,et al. Improving the retrieval of information from external sources , 1991 .

[2] Ankush Gupta,et al. Latent Semantic Indexing based Intelligent Information Retrieval System for Digital Libraries , 2006, J. Comput. Inf. Technol..

[3] S. T. Dumais,et al. Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[4] Hinrich Schütze,et al. Projections for efficient document clustering , 1997, SIGIR '97.

[5] M. F. Porter,et al. An algorithm for suffix stripping , 1997 .

[6] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .