Two time-efficient gibbs sampling inference algorithms for biterm topic model

Biterm Topic Model (BTM) is an effective topic model proposed to handle short texts. However, its standard gibbs sampling inference method (StdBTM) costs much more time than that (StdLDA) of Latent Dirichlet Allocation (LDA). To solve this problem we propose two time-efficient gibbs sampling inference methods, SparseBTM and ESparseBTM, for BTM by making a tradeoff between space and time consumption in this paper. The idea of SparseBTM is to reduce the computation in StdBTM by both recycling intermediate results and utilizing the sparsity of count matrix NWT$\mathbf {N}^{\mathbf {T}}_{\mathbf {W}}$. Theoretically, SparseBTM reduces the time complexity of StdBTM from O(|B| K) to O(|B| Kw) which scales linearly with the sparsity of count matrix NWT$\mathbf {N}^{\mathbf {T}}_{\mathbf {W}}$ (Kw) instead of the number of topics (K) (Kw < K, Kw is the average number of non-zero topics per word type in count matrix NWT$\mathbf {N}^{\mathbf {T}}_{\mathbf {W}}$). Experimental results have shown that in good conditions SparseBTM is approximately 18 times faster than StdBTM. Compared with SparseBTM, ESparseBTM is a more time-efficient gibbs sampling inference method proposed based on SparseBTM. The idea of ESparseBTM is to reduce more computation by recycling more intermediate results through rearranging biterm sequence. In theory, ESparseBTM reduces the time complexity of SparseBTM from O(|B|Kw) to O(R|B|Kw) (0 < R < 1, R is the ratio of the number of biterm types to the number of biterms). Experimental results have shown that the percentage of the time efficiency improved by ESparseBTM on SparseBTM is between 6.4% and 39.5% according to different datasets.

[1]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[2]  Heng Ji,et al.  Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media , 2013, ACL.

[3]  Shuang-Hong Yang,et al.  Dimensionality Reduction and Topic Modeling: From Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond , 2012, Mining Text Data.

[4]  Zhoujun Li,et al.  Concept-based Short Text Classification and Ranking , 2014, CIKM.

[5]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[7]  Thomas Stibor,et al.  Efficient Collapsed Gibbs Sampling for Latent Dirichlet Allocation , 2010, ACML.

[8]  C. J. van Rijsbergen,et al.  Investigating the relationship between language model perplexity and IR precision-recall measures , 2003, SIGIR.

[9]  Jiafeng Guo,et al.  BTM: Topic Modeling over Short Texts , 2014, IEEE Transactions on Knowledge and Data Engineering.

[10]  Jihong Ouyang,et al.  Supervised labeled latent Dirichlet allocation for document categorization , 2014, Applied Intelligence.

[11]  Le Yu,et al.  Collapsed Gibbs sampling for latent Dirichlet allocation on spark , 2014, Big Data 2014.

[12]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[13]  R. Kronmal,et al.  On the Alias Method for Generating Random Variables From a Discrete Distribution , 1979 .

[14]  Adrian F. M. Smith,et al.  Bayesian computation via the gibbs sampler and related markov chain monte carlo methods (with discus , 1993 .

[15]  Alexander J. Smola,et al.  Reducing the sampling complexity of topic models , 2014, KDD.

[16]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[17]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[18]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[19]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[20]  R. Kohn,et al.  On Gibbs sampling for state space models , 1994 .

[21]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[22]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[23]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[24]  S. Chib,et al.  Understanding the Metropolis-Hastings Algorithm , 1995 .

[25]  G. Marsaglia,et al.  Fast Generation of Discrete Random Variables , 2004 .

[26]  Tie-Yan Liu,et al.  LightLDA: Big Topic Models on Modest Computer Clusters , 2014, WWW.

[27]  Yoshihiko Suhara,et al.  Automatically generated spam detection based on sentence-level topic information , 2013, WWW '13 Companion.

[28]  J. Geweke,et al.  Bayesian estimation of state-space models using the Metropolis-Hastings algorithm within Gibbs sampling , 2001 .

[29]  C. Elkan,et al.  Topic Models , 2008 .

[30]  Adrian F. M. Smith,et al.  Simple conditions for the convergence of the Gibbs sampler and Metropolis-Hastings algorithms , 1994 .

[31]  Jeffrey Heer,et al.  Topic Model Diagnostics: Assessing Domain Relevance via Topical Alignment , 2013, ICML.

[32]  Alastair J. Walker,et al.  An Efficient Method for Generating Discrete Random Variables with General Distributions , 1977, TOMS.

[33]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[34]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[35]  Hakan Ferhatosmanoglu,et al.  Short text classification in twitter to improve information filtering , 2010, SIGIR.