Optimize collapsed Gibbs sampling for biterm topic model by alias method

With the popularity of social networks, such as mi-croblogs and Twitter, topic inference for short text is increasingly significant and essential for many content analysis tasks. Biterm topic model (BTM) is superior to conventional topic models in uncovering latent semantic relevance for short text. However, Gibbs sampling employed by BTM is very time consuming when inferring topics, especially for large-scale datasets. It requires O{K) operations per sample for K topics, where K denotes the number of topics in the corpus. In this paper, we propose an acceleration algorithm of BTM, FastBTM, using an efficient sampling method for BTM which only requires O(1) amortized time while the traditional ones scale linearly with the number of topics. FastBTM is based on Metropolis-Hastings and alias method, both of which have been widely adopted in latent Dirichlet allocation (LDA) model and achieved outstanding speedup. We carry out a number of experiments on Tweets2011 Collection dataset and Enron dataset, indicating that our method is robust enough for both short texts and normal documents. Our work can be approximately 9 times faster than traditional Gibbs sampling method per iteration, when setting K = 1000. The source code of FastBTM can be obtained from https://github.com/paperstudy/FastBTM.

[1]  Hua Xu,et al.  Implicit Feature Detection via a Constrained Topic Model and SVM , 2013, EMNLP.

[2]  Zhoujun Li,et al.  Question Retrieval with High Quality Answers in Community Question Answering , 2014, CIKM.

[3]  Erik Cambria,et al.  Discriminative Bi-Term Topic Model for Headline-Based Social News Clustering , 2015, FLAIRS.

[4]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[5]  Alexander J. Smola,et al.  Reducing the sampling complexity of topic models , 2014, KDD.

[6]  Min Peng,et al.  Coherent Topic Hierarchy: A Strategy for Topic Evolutionary Analysis on Microblog Feeds , 2015, WAIM.

[7]  Tie-Yan Liu,et al.  LightLDA: Big Topic Models on Modest Computer Clusters , 2014, WWW.

[8]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[9]  Yuanzhuo Wang,et al.  Populating knowledge base with collective entity mentions: A graph-based approach , 2014, 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014).

[10]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Michael D. Vose,et al.  A Linear Algorithm For Generating Random Numbers With a Given Distribution , 1991, IEEE Trans. Software Eng..

[12]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[13]  Wesley De Neve,et al.  Using topic models for Twitter hashtag recommendation , 2013, WWW.

[14]  Jiafeng Guo,et al.  BTM: Topic Modeling over Short Texts , 2014, IEEE Transactions on Knowledge and Data Engineering.

[15]  Bo Xu,et al.  A Fast Matching Method Based on Semantic Similarity for Short Texts , 2013, NLPCC.

[16]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[17]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[18]  Alastair J. Walker,et al.  An Efficient Method for Generating Discrete Random Variables with General Distributions , 1977, TOMS.

[19]  G. Marsaglia,et al.  Fast Generation of Discrete Random Variables , 2004 .

[20]  Hua Xu,et al.  User-IBTM: An Online Framework for Hashtag Suggestion in Twitter , 2016, WAIM.

[21]  Wenguang Chen,et al.  WarpLDA: a Simple and Efficient O(1) Algorithm for Latent Dirichlet Allocation , 2015, ArXiv.

[22]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[23]  Yan Zhang,et al.  User Based Aggregation for Biterm Topic Model , 2015, ACL.

[24]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[25]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[26]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[27]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[28]  Paola Velardi,et al.  Time Makes Sense: Event Discovery in Twitter Using Temporal Similarity , 2014, 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[29]  L. Tierney Markov Chains for Exploring Posterior Distributions , 1994 .

[30]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[31]  Wenguang Chen,et al.  WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation , 2015, Proc. VLDB Endow..

[32]  Raymond K. Wong,et al.  Web Service Orchestration Topic Mining , 2014, 2014 IEEE International Conference on Web Services.

[33]  Xiaohui Yan,et al.  A Probabilistic Model for Bursty Topic Discovery in Microblogs , 2015, AAAI.

[34]  J. Geweke,et al.  Bayesian estimation of state-space models using the Metropolis-Hastings algorithm within Gibbs sampling , 2001 .