论文信息 - A Mathematical View of Latent Semantic Indexing: Tracing Term Co-occurrences

A Mathematical View of Latent Semantic Indexing: Tracing Term Co-occurrences

ABSTRACT Current research in Latent Semantic Indexing (LSI) shows improvements in performance for a wide variety of information retrieval systems. We propose the development of a theoretical foundation for understanding the values produced in the reduced form of the term-term matrix. We assert that LSI’s use of higher orders of co -occurrence is a critical component of this study. In this work we present experiments that precisely determine the degree of co -occurrence used in LSI. We empirically demonstrate that LSI uses up to fifth order term co-occurrence. We also prove mathematically that a connectivity path exists for every nonzero element in the truncated term-term matrix computed by LSI. A complete understanding of this term transitivity is key to understanding LSI. 1. INTRODUCTION The use of co-occurrence information in textual data has led to improvements in performance when applied to a variety of applications in information retrieval, computational linguistics and textual data mining. Furthermore, many researchers in these fields have developed techniques that explicitly employ second and third order term co-occurrence. Examples include applications such as literature search [14], word sense disambiguation [12], ranking of relevant documents [15], and word selection [8]. Other authors have developed algorithms that implicitly rely on the use of term co-occurrence for applications such as search and retrieval [5], trend detection [14], and stemming [17]. In what follows we refer to various degrees of term transitivity as orders of co-occurrence – first order if two terms co-occur, second order if two terms are linked only by a third, etc. An example of second order co -occurrence follows . Assume that a collection has one document that contains the terms

William M. Pottenger | April Kontostathis | W. Pottenger | April Kontostathis

[1] Susan T. Dumais,et al. LSI meets TREC: A Status Report , 1992, TREC.

[2] Michael A. Malcolm,et al. Computer methods for mathematical computations , 1977 .

[3] R. E. Story,et al. An Explanation of the Effectiveness of Latent Semantic Indexing by Means of a Bayesian Regression Model , 1996, Inf. Process. Manag..

[4] W. Bruce Croft,et al. Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[5] Bruce R. Schatz,et al. Automatic subject indexing using an associative neural network , 1998, DL '98.

[6] Yong-Bin Kim,et al. HDDI™: Hierarchical Distributed Dynamic Indexing , 2001 .

[7] C. Ding. A similarity-based probability model for latent semantic indexing , 1999, SIGIR '99.

[8] Elizabeth R. Jessup,et al. Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[9] Philip Edmonds,et al. Choosing the Word Most Typical in Context Using a Lexical Co-occurrence Network , 1997, ACL.

[10] Peter van der Weerd,et al. Conceptual Grouping in Word Co-Occurrence Networks , 1999, IJCAI.

[11] Susan T. Dumais,et al. Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[12] Don R. Swanson,et al. Complementary structures in disjoint science literatures , 1991, SIGIR '91.

[13] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[14] Michael W. Berry,et al. SVDPACKC (Version 1.0) User''s Guide , 1993 .

[15] H. Schütze,et al. Dimensions of meaning , 1992, Supercomputing '92.

[16] Peter M. Wiemer-Hastings,et al. How Latent is Latent Semantic Analysis? , 1999, IJCAI.