Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis is a novel statistical technique for the analysis of two-mode and co-occurrence data, which has applications in information retrieval and filtering, natural language processing, machine learning from text, and in related areas. Compared to standard Latent Semantic Analysis which stems from linear algebra and performs a Singular Value Decomposition of co-occurrence tables, the proposed method is based on a mixture decomposition derived from a latent class model. This results in a more principled approach which has a solid foundation in statistics. In order to avoid overfitting, we propose a widely applicable generalization of maximum likelihood model fitting by tempered EM. Our approach yields substantial and consistent improvements over Latent Semantic Analysis in a number of experiments.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[3]  Geoffrey C. Fox,et al.  A deterministic annealing approach to clustering , 1990, Pattern Recognit. Lett..

[4]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[5]  Susan T. Dumais,et al.  Personalized information delivery: an analysis of information filtering methods , 1992, CACM.

[6]  Radford M. Neal A new view of the EM algorithm that justifies incremental and other variants , 1993 .

[7]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[8]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[9]  Fernando Pereira,et al.  Aggregate and mixed-order Markov models for statistical language processing , 1997, EMNLP.

[10]  Michael I. Jordan,et al.  Unsupervised Learning from Dyadic Data , 1998 .

[11]  Jerome R. Bellegarda,et al.  Exploiting both local and global constraints for multi-span statistical language modeling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[12]  Thomas Hofmann,et al.  Learning from Dyadic Data , 1998, NIPS.

[13]  Daniel Jurafsky,et al.  Towards better integration of semantic predictors in statistical language modeling , 1998, ICSLP.

[14]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.