A user-oriented semi-supervised probabilistic topic model

Topic modeling has been widely used to mine topics. However, users' individual needs are seldom considered, which is against the trend that individuation becomes more and more important. In this work, we propose a user-oriented probabilistic topic model based on Latent Dirichlet Allocation. Interested and uninterested words are used as supervised information to take users' preferences into account. A self-learning algorithm increasing the quantity of supervised information effectively are also presented. As a semi-supervised model, data with or without supervised information attached are treated differently. In the parameters inference, we integrate the Pólya urn model into the Gibbs sampling process to utilize different kinds of supervised information efficiently. Experiments conducted on real datasets show the model outperforms the state-of-the-art models.

[1]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[2]  Philip Resnik,et al.  GIBBS SAMPLING FOR THE UNINITIATED , 2010 .

[3]  Hosam Mahmoud,et al.  P√≥lya Urn Models , 2008 .

[4]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[5]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[6]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Zhou Pin Semi-supervised document clustering algorithms based on seeds and LDA , 2014 .

[8]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[9]  Jianzhuang Liu,et al.  Probabilistic latent semantic analysis for sketch-based 3D model retrieval , 2014, 2014 4th IEEE International Conference on Information Science and Technology.

[10]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[11]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[12]  Ping Zhou,et al.  A LDA-Based Approach for Semi-Supervised Document Clustering , 2014 .

[13]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[14]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[15]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[16]  Bing Liu,et al.  Mining topics in documents: standing on the shoulders of big data , 2014, KDD.

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  Susan T. Dumais,et al.  Partially labeled topic models for interpretable text mining , 2011, KDD.

[19]  Daniel Barbará,et al.  On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[20]  Doug Downey,et al.  Efficient Methods for Incorporating Knowledge into Topic Models , 2015, EMNLP.

[21]  Zhihua Zhang,et al.  The Singular Value Decomposition, Applications and Beyond , 2015, ArXiv.

[22]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[23]  Raffaele Persico The Singular Value Decomposition , 2014 .

[24]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[25]  David B. Dunson,et al.  Probabilistic topic models , 2011, KDD '11 Tutorials.

[26]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[27]  Ping Zhou,et al.  A Semi-Supervised Text Clustering Algorithm with Word Distribution Weights , 2013 .

[28]  Gregor Heinrich,et al.  A Generic Approach to Topic Models , 2009, ECML/PKDD.