Latent Semantic Indexing

Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the term-document matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance. We propose the technique of random projection as a way of speeding up LSI. We complement our theorems with encouraging experimental results. We also argue that our results may be viewed in a more general framework, as a theoretical basis for the use of spectral methods in a wider class of applications such as collaborative filtering.

[1]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[2]  J. H. Wilkinson The algebraic eigenvalue problem , 1966 .

[3]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  Michael Doob,et al.  Spectra of graphs , 1980 .

[5]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[6]  Gene H. Golub,et al.  Matrix computations , 1983 .

[7]  Susan T. Dumais,et al.  Using latent semantic analysis to improve information retrieval , 1988, CHI 1988.

[8]  Peter Frankl,et al.  The Johnson-Lindenstrauss lemma and the sphericity of some graphs , 1987, J. Comb. Theory B.

[9]  Mark Jerrum,et al.  Approximating the Permanent , 1989, SIAM J. Comput..

[10]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[11]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[12]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[13]  W. Bruce Croft,et al.  A Comparison of Text Retrieval Models , 1992, Comput. J..

[14]  Michael W. Berry,et al.  SVDPACKC (Version 1.0) User''s Guide , 1993 .

[15]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[16]  Douglas W. Oard,et al.  A survey of information retrieval and filtering methods , 1995 .

[17]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[18]  Christos Faloutsos,et al.  Ratio Rules: A New Paradigm for Fast, Quantifiable Data Mining , 1998, VLDB.

[19]  Alan M. Frieze,et al.  Fast Monte-Carlo algorithms for finding low-rank approximations , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[20]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1998, SODA '98.

[21]  Alan M. Frieze,et al.  Clustering in large graphs and matrices , 1999, SODA '99.

[22]  I. Jolliffe Principal Component Analysis , 2002 .