A Mathematical View of Latent Semantic Indexing: Tracing Term Co-occurrences

ABSTRACT Current research in Latent Semantic Indexing (LSI) shows improvements in performance for a wide variety of information retrieval systems. We propose the development of a theoretical foundation for understanding the values produced in the reduced form of the term-term matrix. We assert that LSI’s use of higher orders of co -occurrence is a critical component of this study. In this work we present experiments that precisely determine the degree of co -occurrence used in LSI. We empirically demonstrate that LSI uses up to fifth order term co-occurrence. We also prove mathematically that a connectivity path exists for every nonzero element in the truncated term-term matrix computed by LSI. A complete understanding of this term transitivity is key to understanding LSI. 1. INTRODUCTION The use of co-occurrence information in textual data has led to improvements in performance when applied to a variety of applications in information retrieval, computational linguistics and textual data mining. Furthermore, many researchers in these fields have developed techniques that explicitly employ second and third order term co-occurrence. Examples include applications such as literature search [14], word sense disambiguation [12], ranking of relevant documents [15], and word selection [8]. Other authors have developed algorithms that implicitly rely on the use of term co-occurrence for applications such as search and retrieval [5], trend detection [14], and stemming [17]. In what follows we refer to various degrees of term transitivity as orders of co-occurrence – first order if two terms co-occur, second order if two terms are linked only by a third, etc. An example of second order co -occurrence follows . Assume that a collection has one document that contains the terms

[1]  Susan T. Dumais,et al.  LSI meets TREC: A Status Report , 1992, TREC.

[2]  Michael A. Malcolm,et al.  Computer methods for mathematical computations , 1977 .

[3]  R. E. Story,et al.  An Explanation of the Effectiveness of Latent Semantic Indexing by Means of a Bayesian Regression Model , 1996, Inf. Process. Manag..

[4]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[5]  Bruce R. Schatz,et al.  Automatic subject indexing using an associative neural network , 1998, DL '98.

[6]  Yong-Bin Kim,et al.  HDDI™: Hierarchical Distributed Dynamic Indexing , 2001 .

[7]  C. Ding A similarity-based probability model for latent semantic indexing , 1999, SIGIR '99.

[8]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[9]  Philip Edmonds,et al.  Choosing the Word Most Typical in Context Using a Lexical Co-occurrence Network , 1997, ACL.

[10]  Peter van der Weerd,et al.  Conceptual Grouping in Word Co-Occurrence Networks , 1999, IJCAI.

[11]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[12]  Don R. Swanson,et al.  Complementary structures in disjoint science literatures , 1991, SIGIR '91.

[13]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[14]  Michael W. Berry,et al.  SVDPACKC (Version 1.0) User''s Guide , 1993 .

[15]  H. Schütze,et al.  Dimensions of meaning , 1992, Supercomputing '92.

[16]  Peter M. Wiemer-Hastings,et al.  How Latent is Latent Semantic Analysis? , 1999, IJCAI.