ABSTRACT Current research in Latent Semantic Indexing (LSI) shows improvements in performance for a wide variety of information retrieval systems. We propose the development of a theoretical foundation for understanding the values produced in the reduced form of the term-term matrix. We assert that LSI’s use of higher orders of co -occurrence is a critical component of this study. In this work we present experiments that precisely determine the degree of co -occurrence used in LSI. We empirically demonstrate that LSI uses up to fifth order term co-occurrence. We also prove mathematically that a connectivity path exists for every nonzero element in the truncated term-term matrix computed by LSI. A complete understanding of this term transitivity is key to understanding LSI. 1. INTRODUCTION The use of co-occurrence information in textual data has led to improvements in performance when applied to a variety of applications in information retrieval, computational linguistics and textual data mining. Furthermore, many researchers in these fields have developed techniques that explicitly employ second and third order term co-occurrence. Examples include applications such as literature search [14], word sense disambiguation [12], ranking of relevant documents [15], and word selection [8]. Other authors have developed algorithms that implicitly rely on the use of term co-occurrence for applications such as search and retrieval [5], trend detection [14], and stemming [17]. In what follows we refer to various degrees of term transitivity as orders of co-occurrence – first order if two terms co-occur, second order if two terms are linked only by a third, etc. An example of second order co -occurrence follows . Assume that a collection has one document that contains the terms
[1]
Susan T. Dumais,et al.
LSI meets TREC: A Status Report
,
1992,
TREC.
[2]
Michael A. Malcolm,et al.
Computer methods for mathematical computations
,
1977
.
[3]
R. E. Story,et al.
An Explanation of the Effectiveness of Latent Semantic Indexing by Means of a Bayesian Regression Model
,
1996,
Inf. Process. Manag..
[4]
W. Bruce Croft,et al.
Corpus-based stemming using cooccurrence of word variants
,
1998,
TOIS.
[5]
Bruce R. Schatz,et al.
Automatic subject indexing using an associative neural network
,
1998,
DL '98.
[6]
Yong-Bin Kim,et al.
HDDI™: Hierarchical Distributed Dynamic Indexing
,
2001
.
[7]
C. Ding.
A similarity-based probability model for latent semantic indexing
,
1999,
SIGIR '99.
[8]
Elizabeth R. Jessup,et al.
Matrices, Vector Spaces, and Information Retrieval
,
1999,
SIAM Rev..
[9]
Philip Edmonds,et al.
Choosing the Word Most Typical in Context Using a Lexical Co-occurrence Network
,
1997,
ACL.
[10]
Peter van der Weerd,et al.
Conceptual Grouping in Word Co-Occurrence Networks
,
1999,
IJCAI.
[11]
Susan T. Dumais,et al.
Using Linear Algebra for Intelligent Information Retrieval
,
1995,
SIAM Rev..
[12]
Don R. Swanson,et al.
Complementary structures in disjoint science literatures
,
1991,
SIGIR '91.
[13]
Richard A. Harshman,et al.
Indexing by Latent Semantic Analysis
,
1990,
J. Am. Soc. Inf. Sci..
[14]
Michael W. Berry,et al.
SVDPACKC (Version 1.0) User''s Guide
,
1993
.
[15]
H. Schütze,et al.
Dimensions of meaning
,
1992,
Supercomputing '92.
[16]
Peter M. Wiemer-Hastings,et al.
How Latent is Latent Semantic Analysis?
,
1999,
IJCAI.