RPLSA: A novel updating scheme for Probabilistic Latent Semantic Analysis

A novel updating method for Probabilistic Latent Semantic Analysis (PLSA), called Recursive PLSA (RPLSA), is proposed. The updating of conditional probabilities is derived from first principles for both the asymmetric and the symmetric PLSA formulations. The performance of RPLSA for both formulations is compared to that of the PLSA folding-in, the PLSA rerun from the breakpoint, and well-known LSA updating methods, such as the singular value decomposition (SVD) folding-in and the SVD-updating. The experimental results demonstrate that the RPLSA outperforms the other updating methods under study with respect to the maximization of the average log-likelihood and the minimization of the average absolute error between the probabilities estimated by the updating methods and those derived by applying the non-adaptive PLSA from scratch. A comparison in terms of CPU run time is conducted as well. Finally, in document clustering using the Adjusted Rand index, it is demonstrated that the clusters generated by the RPLSA are: (a) similar to those generated by the PLSA applied from scratch; (b) closer to the ground truth than those created by the other PLSA or LSA updating methods.

[1]  Jerome R. Bellegarda Fast update of latent semantic spaces using a linear transform framework , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Susan T. Dumais,et al.  O'brien. using linear algebra for intelligent information retrieval. technical report ut-cs-94-270 , 1994 .

[3]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[4]  Jane Elizabeth Bailey Tougas,et al.  Folding-up: A Hybrid Method for Updating the Partial Singular Value Decomposition in Latent Semantic Indexing , 2005 .

[5]  Thomas Hofmann,et al.  Topic-based language models using EM , 1999, EUROSPEECH.

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[8]  Gavin W. O''Brien,et al.  Information Management Tools for Updating an SVD-Encoded Indexing Scheme , 1994 .

[9]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[10]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[11]  Michael I. Jordan,et al.  Unsupervised Learning from Dyadic Data , 1998 .

[12]  Jen-Tzung Chien,et al.  Adaptive Bayesian Latent Semantic Analysis , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Jeff A. Bilmes,et al.  Graphical models and automatic speech recognition , 2002 .

[14]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[15]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[16]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[17]  Xiang Cheng,et al.  Incremental probabilistic latent semantic analysis for automatic question recommendation , 2008, RecSys '08.

[18]  Thomas Hofmann,et al.  Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization , 1999, NIPS.

[19]  Hongyuan Zha,et al.  On Updating Problems in Latent Semantic Indexing , 1997, SIAM J. Sci. Comput..

[20]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[21]  Thorsten Brants,et al.  Test Data Likelihood for PLSA Models , 2005, Information Retrieval.

[22]  Mark Johnson,et al.  Mathematical Foundations of Speech and Language Processing , 2004 .

[23]  Alexander Hinneburg,et al.  Bayesian Folding-In with Dirichlet Kernels for PLSI , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[24]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[25]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[26]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[27]  Meng Chang Chen,et al.  Using Incremental PLSI for Threshold-Resilient Online Event Analysis , 2008, IEEE Transactions on Knowledge and Data Engineering.