FastBTM: Reducing the sampling time for biterm topic model

Abstract Due to the popularity of social networks, such as microblogs and Twitter, a vast amount of short text data is created every day. Much recent research in short text becomes increasingly significant, such as topic inference for short text. Biterm topic model (BTM) benefits from the word co-occurrence patterns of the corpus, which makes it perform better than conventional topic models in uncovering latent semantic relevance for short text. However, BTM resorts to Gibbs sampling to infer topics, which is very time consuming, especially for large-scale datasets or when the number of topics is extremely large. It requires O ( K ) operations per sample for K topics, where K denotes the number of topics in the corpus. In this paper, we propose an acceleration algorithm of BTM, FastBTM, using an efficient sampling method for BTM, which converges much faster than BTM without degrading topic quality. FastBTM is based on Metropolis-Hastings and alias method, both of which have been widely adopted in Latent Dirichlet Allocation (LDA) model and achieved outstanding speedup. Our FastBTM can effectively reduce the sampling complexity of biterm topic model from O ( K ) to O (1) amortized time. We carry out a number of experiments on three datasets including two short text datasets, Tweets2011 Collection dataset and Yahoo! Answers dataset, and one long document dataset, Enron dataset. Our experimental results show that when the number of topics K increases, the gap in running time speed between FastBTM and BTM gets especially larger. In addition, our FastBTM is effective for both short text datasets and long document datasets.

[1]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[2]  G. Marsaglia,et al.  Fast Generation of Discrete Random Variables , 2004 .

[3]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[4]  Jun Li,et al.  Social emotion classification of short text via topic-level maximum entropy model , 2016, Inf. Manag..

[5]  Hamido Fujita,et al.  Semi-automatic Detection of Sentiment Hashtags in Social Networks , 2015, SoMeT.

[6]  Wesley De Neve,et al.  Using topic models for Twitter hashtag recommendation , 2013, WWW.

[7]  Jiafeng Guo,et al.  BTM: Topic Modeling over Short Texts , 2014, IEEE Transactions on Knowledge and Data Engineering.

[8]  Raymond Y. K. Lau,et al.  Generating Incidental Word-Learning Tasks via Topic-Based and Load-Based Profiles , 2016, IEEE MultiMedia.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  Raymond K. Wong,et al.  Web Service Orchestration Topic Mining , 2014, 2014 IEEE International Conference on Web Services.

[11]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[12]  Susan T. Dumais,et al.  Partially labeled topic models for interpretable text mining , 2011, KDD.

[13]  Wenguang Chen,et al.  WarpLDA: a Simple and Efficient O(1) Algorithm for Latent Dirichlet Allocation , 2015, ArXiv.

[14]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[15]  L. Tierney Markov Chains for Exploring Posterior Distributions , 1994 .

[16]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[17]  Alastair J. Walker,et al.  An Efficient Method for Generating Discrete Random Variables with General Distributions , 1977, TOMS.

[18]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[19]  Zhoujun Li,et al.  Question Retrieval with High Quality Answers in Community Question Answering , 2014, CIKM.

[20]  Alexander J. Smola,et al.  Reducing the sampling complexity of topic models , 2014, KDD.

[21]  Peng Wang,et al.  Short Text Feature Enrichment Using Link Analysis on Topic-Keyword Graph , 2014, NLPCC.

[22]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Bo Xu,et al.  A Fast Matching Method Based on Semantic Similarity for Short Texts , 2013, NLPCC.

[24]  David Newman,et al.  External evaluation of topic models , 2009 .

[25]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[26]  Erik Cambria,et al.  Discriminative Bi-Term Topic Model for Headline-Based Social News Clustering , 2015, FLAIRS.

[27]  Tie-Yan Liu,et al.  LightLDA: Big Topic Models on Modest Computer Clusters , 2014, WWW.

[28]  Xiaohui Yan,et al.  A Probabilistic Model for Bursty Topic Discovery in Microblogs , 2015, AAAI.

[29]  J. Geweke,et al.  Bayesian estimation of state-space models using the Metropolis-Hastings algorithm within Gibbs sampling , 2001 .

[30]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[31]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[32]  Hamido Fujita,et al.  A hybrid approach to the sentiment analysis problem at the sentence level , 2016, Knowl. Based Syst..

[33]  Hua Xu,et al.  User-IBTM: An Online Framework for Hashtag Suggestion in Twitter , 2016, WAIM.

[34]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[35]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[36]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[37]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[38]  Yin Jian,et al.  A Biterm-based Dirichlet Process Topic Model for Short Texts , 2014 .

[39]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[40]  Hua Xu,et al.  Implicit Feature Detection via a Constrained Topic Model and SVM , 2013, EMNLP.

[41]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[42]  Michael D. Vose,et al.  A Linear Algorithm For Generating Random Numbers With a Given Distribution , 1991, IEEE Trans. Software Eng..

[43]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[44]  Yan Zhang,et al.  User Based Aggregation for Biterm Topic Model , 2015, ACL.