Latent semantic indexing: a probabilistic analysis

Latent semantic indexing LSI is an information retrieval technique based on the spectral analysis of the term document matrix whose empirical success had heretofore been without rigorous prediction and explanation We prove that under certain conditions LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance We also propose the technique of random projection as a way of speeding up LSI We complement our theorems with encouraging experimental results We also argue that our results may be viewed in a more general framework as a theoretical basis for the use of spectral methods in a wider class of applications such as collaborative ltering

[1]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[2]  J. H. Wilkinson The algebraic eigenvalue problem , 1966 .

[3]  Michael Doob,et al.  Spectra of graphs , 1980 .

[4]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[5]  Gene H. Golub,et al.  Matrix computations , 1983 .

[6]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[7]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[8]  Susan T. Dumais,et al.  Using latent semantic analysis to improve information retrieval , 1988, CHI 1988.

[9]  Peter Frankl,et al.  The Johnson-Lindenstrauss lemma and the sphericity of some graphs , 1987, J. Comb. Theory B.

[10]  Mark Jerrum,et al.  Approximating the Permanent , 1989, SIAM J. Comput..

[11]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[12]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[13]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[14]  W. Bruce Croft,et al.  A Comparison of Text Retrieval Models , 1992, Comput. J..

[15]  Michael W. Berry,et al.  SVDPACKC (Version 1.0) User''s Guide , 1993 .

[16]  N. Fisher,et al.  Probability Inequalities for Sums of Bounded Random Variables , 1994 .

[17]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[18]  Douglas W. Oard,et al.  A survey of information retrieval and filtering methods , 1995 .

[19]  Ronald Fagin,et al.  Combining fuzzy information from multiple systems (extended abstract) , 1996, PODS.

[20]  Prabhakar Raghavan,et al.  Information retrieval algorithms: a survey , 1997, SODA '97.

[21]  J. Kleinberg,et al.  Authoritative Soueces in a Hyper-linked Environment , 1998, SODA 1998.

[22]  M. KleinbergJon Authoritative sources in a hyperlinked environment , 1999 .

[23]  Alan M. Frieze,et al.  Clustering in large graphs and matrices , 1999, SODA '99.