Self-organizing weighted incremental probabilistic latent semantic analysis

PLSA (Probabilistic Latent Semantic Analysis) is a popular topic modeling technique which has been widely applied to text mining applications to discover the underlying topics embedded in the data corpus. However, due to the variability of increasing data, it is necessary to discover the dynamic topics and process the large dataset incrementally. Moreover, PLSA models suffer from the problem of inferencing new documents. To overcome these problems, in this paper, we propose a novel Weighted Incremental PLSA algorithm called WIPLSA to dynamically discover topics and incrementally learn the topics from new documents. The experiments verify that the proposed WIPLSA could capture the dynamic topics hidden in the dynamic updating data corpus. Compared with PLSA, MAP PLSA and QB PLSA, WIPLSA performs better in perspexity on large dataset, which make it applicable for big data mining. In addition, WIPLSA has good performance in the application of document categorization.

[1]  Yang Yan,et al.  Fuzzy semi-supervised co-clustering for text documents , 2013, Fuzzy Sets Syst..

[2]  ChengXiang Zhai,et al.  A Note on EM Algorithm for Probabilistic Latent Semantic Analysis , 2008 .

[3]  David M. Blei,et al.  Visualizing Topic Models , 2012, ICWSM.

[4]  Charu C. Aggarwal,et al.  Mining Text Data , 2012 .

[5]  Xiang Cheng,et al.  Incremental probabilistic latent semantic analysis for automatic question recommendation , 2008, RecSys '08.

[6]  Stefan M. Rüger,et al.  Weakly Supervised Joint Sentiment-Topic Detection from Text , 2012, IEEE Transactions on Knowledge and Data Engineering.

[7]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[8]  Michael Nokel,et al.  Topic Models Can Improve Domain Term Extraction , 2013, ECIR.

[9]  Chong Wang,et al.  Continuous Time Dynamic Topic Models , 2008, UAI.

[10]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[11]  John T. Stasko,et al.  VisIRR: Visual analytics for information retrieval and recommendation with large-scale document data , 2014, 2014 IEEE Conference on Visual Analytics Science and Technology (VAST).

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Alexander G. Gray,et al.  VisIRR: Interactive Visual Information Retrieval and Recommendation for Large-scale Document Data , 2013 .

[14]  Meng Chang Chen,et al.  Using Incremental PLSI for Threshold-Resilient Online Event Analysis , 2008, IEEE Transactions on Knowledge and Data Engineering.

[15]  Fuzhen Zhuang,et al.  PPLSA: Parallel Probabilistic Latent Semantic Analysis Based on MapReduce , 2012, Intelligent Information Processing.

[16]  Tapio Salakoski,et al.  EVEX in ST’13: Application of a large-scale text mining resource to event extraction and network construction , 2013, BioNLP@ACL.

[17]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[18]  Jen-Tzung Chien,et al.  Adaptive Bayesian Latent Semantic Analysis , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Liejun Wang,et al.  Secure and Efficient Mutual User Authentication Protocol for Wireless Sensor Networks , 2014 .

[20]  Suvrit Sra,et al.  Incremental Aspect Models for Mining Document Streams , 2006, PKDD.

[21]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[22]  Fakhri Karray,et al.  An efficient concept-based retrieval model for enhancing text retrieval quality , 2013, ICUIMC '13.

[23]  Jiye Liang,et al.  A novel fuzzy clustering algorithm with between-cluster information for categorical data , 2013, Fuzzy Sets Syst..

[24]  Yuefeng Li,et al.  Extracting news blog hot topics based on the W2T Methodology , 2013, World Wide Web.

[25]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[26]  Yongji Wang,et al.  Incremental Learning of Triadic PLSA for Collaborative Filtering , 2009, AMT.

[27]  Michal Rosen-Zvi,et al.  Latent Topic Models for Hypertext , 2008, UAI.

[28]  Xiaodong Wang,et al.  A Method of Hot Topic Detection in Blogs Using N-gram Model , 2013, J. Softw..

[29]  Chien-Liang Liu,et al.  Clustering documents with labeled and unlabeled documents using fuzzy semi-Kmeans , 2013, Fuzzy Sets Syst..

[30]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[31]  Jordan L. Boyd-Graber,et al.  Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce , 2012, WWW.