A Framework for Understanding LSI Performance

In this paper we present a theoretical model for understanding the performance of LSI search and retrieval applications. Many models for understanding LSI have been proposed. Ours is the first to study the values produced by LSI in the term dimension vectors. The framework presented here is based on term co-occurrence data. We show a strong correlation between second order term co-occurrence and the values produced by the SVD algorithm that forms the foundation for LSI. We also present a mathematical proof that the SVD algorithm encapsulates term co-occurrence information.

[1]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[2]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[3]  Susan T. Dumais,et al.  Using LSI for information filtering: TREC-3 experiments , 1995 .

[4]  Susan T. Dumais,et al.  LSI meets TREC: A Status Report , 1992, TREC.

[5]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[6]  Padma Raghavan,et al.  Level search schemes for information filtering and retrieval , 2001, Inf. Process. Manag..

[7]  R. E. Story,et al.  An Explanation of the Effectiveness of Latent Semantic Indexing by Means of a Bayesian Regression Model , 1996, Inf. Process. Manag..

[8]  H. Schütze,et al.  Dimensions of meaning , 1992, Supercomputing '92.

[9]  Ewan Klein,et al.  Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics , 2000, ACL 2000.

[10]  William M. Pottenger,et al.  A Mathematical View of Latent Semantic Indexing: Tracing Term Co-occurrences , 2002 .

[11]  C. Ding A similarity-based probability model for latent semantic indexing , 1999, SIGIR '99.

[12]  Peter M. Wiemer-Hastings,et al.  How Latent is Latent Semantic Analysis? , 1999, IJCAI.

[13]  William M. Pottenger,et al.  Detecting Patterns in the LSI Term-Term Matrix , 2002 .

[14]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI) and TREC-2 , 1993, TREC.

[15]  Haym Hirsh,et al.  Using LSI for text classification in the presence of background text , 2001, CIKM '01.

[16]  Philip Edmonds,et al.  Choosing the Word Most Typical in Context Using a Lexical Co-occurrence Network , 1997, ACL.

[17]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..