Detecting Patterns in the LSI Term-Term Matrix

Higher order co-occurrences play a key role in the effectiveness of systems used for text mining. A wide variety of applications use techniques that explicitly or implicitly employ a limited degree of transitivity in the cooccurrence relation. In this work we show use of higher orders of co-occurrence in the Singular Value Decomposition (SVD) algorithm and, by inference, on the systems that rely on SVD, such as LSI. Our empirical and mathematical studies prove that term co-occurrence plays a crucial role in LSI. This work is the first to study the values produced in the truncated term-term matrix, and we have discovered an explanation for why certain term pairs receive a high similarity value, while others receive low (and even negative) values. Thus we have discovered the basis for the claim that is frequently made for LSI: LSI emphasizes important semantic distinctions (latent semantics) while reducing noise in the data

[1]  Michael A. Malcolm,et al.  Computer methods for mathematical computations , 1977 .

[2]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[3]  Don R. Swanson,et al.  Complementary structures in disjoint science literatures , 1991, SIGIR '91.

[4]  Susan T. Dumais,et al.  LSI meets TREC: A Status Report , 1992, TREC.

[5]  H. Schütze,et al.  Dimensions of meaning , 1992, Supercomputing '92.

[6]  Michael W. Berry,et al.  SVDPACKC (Version 1.0) User''s Guide , 1993 .

[7]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI) and TREC-2 , 1993, TREC.

[8]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[9]  Susan T. Dumais,et al.  Using LSI for information filtering: TREC-3 experiments , 1995 .

[10]  R. E. Story,et al.  An Explanation of the Effectiveness of Latent Semantic Indexing by Means of a Bayesian Regression Model , 1996, Inf. Process. Manag..

[11]  Philip Edmonds Choosing the word most typical in context using a lexical co-occurrence network , 1997 .

[12]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[13]  Bruce R. Schatz,et al.  Automatic subject indexing using an associative neural network , 1998, DL '98.

[14]  Peter M. Wiemer-Hastings,et al.  How Latent is Latent Semantic Analysis? , 1999, IJCAI.

[15]  Peter van der Weerd,et al.  Conceptual Grouping in Word Co-Occurrence Networks , 1999, IJCAI.

[16]  C. Ding A similarity-based probability model for latent semantic indexing , 1999, SIGIR '99.

[17]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[18]  Yong-Bin Kim,et al.  HDDI™: Hierarchical Distributed Dynamic Indexing , 2001 .

[19]  Haym Hirsh,et al.  Using LSI for text classification in the presence of background text , 2001, CIKM '01.

[20]  William M. Pottenger,et al.  A Mathematical View of Latent Semantic Indexing: Tracing Term Co-occurrences , 2002 .

[21]  W. Pottenger,et al.  Improving Retrieval Performance with Positive and Negative Equivalence Classes of Terms , 2002 .