Latent semantic indexing and large dataset: Study of term-weighting schemes

The primary purpose of an information retrieval (IR) system is to retrieve all the relevant documents, which are relevant to the user query. Latent Semantic Indexing/Analysis (LSI/LSA) based ad hoc document retrieval task investigates the performance of retrieval systems that search a static set of documents using new questions. Performance of LSI has been tested by others for several smaller datasets (e.g. MED, CISI abstracts) however, LSI has not been tested for a large dataset. So, we decided to test LSI for a very large dataset. We used TREC-8 LA Times dataset for our experimentation. We applied three different term weighting schemes and our own stop word list to judge the performance. Recall-precision graph and Coefficient of Variation (CV) were used to evaluate the retrieval performance of LSI based retrieval system. We found tf-idf term weighting scheme performs better than log-entropy and raw term frequency weighting schemes when the test collection became very large.