ABSTRACT Current research in Latent Semantic Indexing (LSI) shows improvements in performance for a wide variety of information retrieval systems. We propose the development of a theoretical foundation for understanding the values produced in the reduced form of the term-term matrix. We assert that LSI’s use of higher orders of co -occurrence is a critical component of this study. In this work we present experiments that precisely determine the degree of co -occurrence used in LSI. We empirically demonstrate that LSI uses up to fifth order term co-occurrence. We also prove mathematically that a connectivity path exists for every nonzero element in the truncated term-term matrix computed by LSI. A complete understanding of this term transitivity is key to understanding LSI. 1. INTRODUCTION The use of co-occurrence information in textual data has led to improvements in performance when applied to a variety of applications in information retrieval, computational linguistics and textual data mining. Furthermore, many researchers in these fields have developed techniques that explicitly employ second and third order term co-occurrence. Examples include applications such as literature search [14], word sense disambiguation [12], ranking of relevant documents [15], and word selection [8]. Other authors have developed algorithms that implicitly rely on the use of term co-occurrence for applications such as search and retrieval [5], trend detection [14], and stemming [17]. In what follows we refer to various degrees of term transitivity as orders of co-occurrence – first order if two terms co-occur, second order if two terms are linked only by a third, etc. An example of second order co -occurrence follows . Assume that a collection has one document that contains the terms
Susan T. Dumais,et al.
LSI meets TREC: A Status Report
Michael A. Malcolm,et al.
Computer methods for mathematical computations
R. E. Story,et al.
An Explanation of the Effectiveness of Latent Semantic Indexing by Means of a Bayesian Regression Model
Inf. Process. Manag..
W. Bruce Croft,et al.
Corpus-based stemming using cooccurrence of word variants
Bruce R. Schatz,et al.
Automatic subject indexing using an associative neural network
DL '98.
Yong-Bin Kim,et al.
HDDI™: Hierarchical Distributed Dynamic Indexing
C. Ding.
A similarity-based probability model for latent semantic indexing
SIGIR '99.
Elizabeth R. Jessup,et al.
Matrices, Vector Spaces, and Information Retrieval
SIAM Rev..
Philip Edmonds,et al.
Choosing the Word Most Typical in Context Using a Lexical Co-occurrence Network
Peter van der Weerd,et al.
Conceptual Grouping in Word Co-Occurrence Networks
Susan T. Dumais,et al.
Using Linear Algebra for Intelligent Information Retrieval
SIAM Rev..
Don R. Swanson,et al.
Complementary structures in disjoint science literatures
SIGIR '91.
Richard A. Harshman,et al.
Indexing by Latent Semantic Analysis
J. Am. Soc. Inf. Sci..
Michael W. Berry,et al.
SVDPACKC (Version 1.0) User''s Guide
H. Schütze,et al.
Dimensions of meaning
Supercomputing '92.
Peter M. Wiemer-Hastings,et al.
How Latent is Latent Semantic Analysis?