Understanding LSI Via The Truncated Term-Term Matrix

In this thesis, we study the relation between Latent Semantic Indexing (LSI) and the co-occurrence of terms in collections. LSI is a method for automatic indexing and retrieval, which is based on the vector space model and which represents the documents and computes the relevance scores in a reduced, topic-related space. For our study, we view LSI as a document expansion method, i.e. for a pair of terms, the occurrence of one of them in a document increases or decreases the importance of the other term for the document, depending on the respective entry in the expansion matrix. We study the relation between the expansion matrix and the co-occurrence information of the pairs of terms in collections. We find out that the entries of the expansion matrix are influenced by the order of co-occurrence of the pairs of terms. We then show that the retrieval performance of LSI for the optimal choice of parameters can be obtained when the expansion matrix used is a simple linear combination of the first and the second order co-occurrences. Hiermit erklare ich an Eides Statt, diese Diplomarbeit selbststandig angefertigt, nur die angegebenen Quellen benutzt und sie noch keinem anderen Prufungsamt vorgelegt zu haben. Saarbrucken, den 31. Mai 2005

[1]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[2]  Richard A. Harshman,et al.  Information retrieval using a singular value decomposition model of latent semantic structure , 1988, SIGIR '88.

[3]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[4]  Georges Dupret,et al.  Latent concepts and the number orthogonal factors in latent semantic analysis , 2003, SIGIR.

[5]  Gerhard Weikum,et al.  Information Retrieval by Dimension Reduction A compartive Study , 2003 .

[6]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[7]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[8]  William M. Pottenger,et al.  Detecting Patterns in the LSI Term-Term Matrix , 2002 .

[9]  W. Pottenger,et al.  Improving Retrieval Performance with Positive and Negative Equivalence Classes of Terms , 2002 .

[10]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[11]  William M. Pottenger,et al.  A Mathematical View of Latent Semantic Indexing: Tracing Term Co-occurrences , 2002 .

[12]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[13]  William M. Pottenger,et al.  A framework for understanding Latent Semantic Indexing (LSI) performance , 2006, Inf. Process. Manag..

[14]  Jeffrey C. Lagarias,et al.  Convergence Properties of the Nelder-Mead Simplex Method in Low Dimensions , 1998, SIAM J. Optim..

[15]  U. Krengel Einführung in die Wahrscheinlichkeitstheorie und Statistik , 1988 .

[16]  April Kontostathis,et al.  Analysis of the values in the LSI Term-Term Matrix , 2004 .

[17]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[18]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[19]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[20]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..