Modeling hidden topics on document manifold

Topic modeling has been a key problem for document analysis. One of the canonical approaches for topic modeling is Probabilistic Latent Semantic Indexing, which maximizes the joint probability of documents and terms in the corpus. The major disadvantage of PLSI is that it estimates the probability distribution of each document on the hidden topics independently and the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting. Latent Dirichlet Allocation (LDA) is proposed to overcome this problem by treating the probability distribution of each document over topics as a hidden random variable. Both of these two methods discover the hidden topics in the Euclidean space. However, there is no convincing evidence that the document space is Euclidean, or flat. Therefore, it is more natural and reasonable to assume that the document space is a manifold, either linear or nonlinear. In this paper, we consider the problem of topic modeling on intrinsic document manifold. Specifically, we propose a novel algorithm called Laplacian Probabilistic Latent Semantic Indexing (LapPLSI) for topic modeling. LapPLSI models the document space as a submanifold embedded in the ambient space and directly performs the topic modeling on this document manifold in question. We compare the proposed LapPLSI approach with PLSI and LDA on three text data sets. Experimental results show that LapPLSI provides better representation in the sense of semantic structure.

[1]  Xiaojin Zhu,et al.  Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning , 2005, ICML.

[2]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[3]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[4]  Mikhail Belkin,et al.  Manifold Regularization : A Geometric Framework for Learning from Examples , 2004 .

[5]  Chris H. Q. Ding,et al.  Spectral Relaxation for K-means Clustering , 2001, NIPS.

[6]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[7]  Sandhya Dwarkadas,et al.  On scaling latent semantic indexing for large peer-to-peer systems , 2004, SIGIR '04.

[8]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[9]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[10]  David M. Pennock,et al.  Probabilistic Models for Unified Collaborative and Content-Based Recommendation in Sparse-Data Environments , 2001, UAI.

[11]  W. Press,et al.  Numerical Recipes in Fortran: The Art of Scientific Computing.@@@Numerical Recipes in C: The Art of Scientific Computing. , 1994 .

[12]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[13]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[14]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Luo Si,et al.  Adjusting Mixture Weights of Gaussian Mixture Model via Regularized Probabilistic Latent Semantic Analysis , 2005, PAKDD.

[16]  Mikhail Belkin,et al.  Problems of learning on manifolds , 2003 .

[17]  Jiawei Han,et al.  Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.

[18]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  Zheng Chen,et al.  Latent semantic analysis for multiple-type interrelated data objects , 2006, SIGIR.

[21]  Wei-Ying Ma,et al.  Locality preserving indexing for document representation , 2004, SIGIR '04.

[22]  Yousef Saad,et al.  Polynomial filtering in latent semantic indexing for information retrieval , 2004, SIGIR '04.

[23]  U. Feige,et al.  Spectral Graph Theory , 2015 .

[24]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[25]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[26]  Rie Kubota Ando Latent semantic space: iterative scaling improves precision of inter-document similarity measurement , 2000, SIGIR '00.

[27]  Xi Chen,et al.  Text classification with kernels on the multinomial manifold , 2005, SIGIR '05.