Stochastic collapsed variational Bayesian inference for biterm topic model

It is useful for many applications to find out meaningful topics from short texts, such as tweets and comments on websites. Since directly applying conventional topic models (e.g., LDA) to short texts often produces poor results, as a general approach to short texts, a biterm topic model (BTM) was recently proposed. However, the original BTM implementation uses collapsed Gibbs sampling (CGS) for its inference, which requires many iterations over the entire dataset. On the other hand, for LDA, there have been proposed many fast inference algorithms throughout the decade. Among them, a recently proposed stochastic collapsed variational Bayesian inference (SCVB0) is promising because it is applicable to an online setting and takes advantage of the collapsed representation, which results in an improved variational bound. Applying the idea of SCVB0, we develop a fast one-pass inference algorithm for BTM, which can be used to analyze large-scale general short texts and is extensible to an online setting. To evaluate the performance of the proposed algorithm, we conducted several experiments using short texts on Twitter. Experimental results showed that our algorithm found out meaningful topics significantly faster than the original algorithm.

[1]  Yuan Zuo,et al.  Word network topic model: a simple but general solution for short and imbalanced texts , 2014, Knowledge and Information Systems.

[2]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[3]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[4]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[5]  David M. Blei,et al.  A Filtering Approach to Stochastic Variational Inference , 2014, NIPS.

[6]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[7]  Bo Zhao,et al.  PET: a statistical model for popular events tracking in social communities , 2010, KDD.

[8]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[9]  James R. Foulds,et al.  Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation , 2013, KDD.

[10]  Eric P. Xing,et al.  Social Links from Latent Topics in Microblogs , 2010, HLT-NAACL 2010.

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[13]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[14]  Jiafeng Guo,et al.  BTM: Topic Modeling over Short Texts , 2014, IEEE Transactions on Knowledge and Data Engineering.

[15]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.